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Chapter 1 Introduction 


This guide provides optimization information and recommendations for the AMD Athlon™ 64 and 
AMD Opteron™ processors. These optimizations are designed to yield software code that is fast, 
compact, and efficient. Toward this end, the optimizations in each of the following chapters are listed 
in order of importance. 

This chapter covers the following topics: 


Topic 

Page 

Intended Audience 

1 

Getting Started Quickly 

1 

Using This Guide 

2 

Important New Terms 

4 

Key Optimizations 

6 


1.1 Intended Audience 

This book is intended for compiler and assembler designers, as well as C, C++, and assembly- 
language programmers writing performance-sensitive code sequences. This guide assumes that you 
are familiar with the AMD64 instruction set and the AMD64 architecture (registers and programming 
modes). For complete information on the AMD64 architecture and instruction set, see the 
multivolume AMD64 Architecture Programmer s Manual available from AMD.com. Documentation 
volumes and their order numbers are provided below. 


Title Order no. 

Volume 1, Application Programming 24592 

Volume 2, System Programming 24593 

Volume 3, General-Purpose and System Instructions 24594 

Volume 4, 128-Bit Media Instructions 26568 

Volume 5, 64-Bit Media and x87 Floating-Point Instructions 26569 


1.2 Getting Started Quickly 

More experienced readers may skip to “Key Optimizations” on page 6, which identifies the most 
important optimizations. 


Chapter 1 


Introduction 


1 
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1.3 Using This Guide 

This chapter explains how to get the most benefit from this guide. It defines important new terms you 
will need to understand before reading the rest of this guide and lists the most important optimizations 
by rank. 

Chapter 2 describes techniques that you can use to optimize your C and C++ source code. The 
“Application” section for each optimization indicates whether the optimization applies to 32-bit 
software, 64-bit software, or both. 

Chapter 3 presents general assembly-language optimizations that improve the performance of 
software designed to run in 64-bit mode. All optimizations in this chapter apply only to 64-bit 
software. 

The remaining chapters describe assembly-language optimizations. The “Application” section under 
each optimization indicates whether the optimization applies to 32-bit software, 64-bit software, or 


both. 


Chapter 4 

Instruction-Decoding Optimizations 

Chapter 5 

Cache and Memory Optimizations 

Chapter 6 

Branch Optimizations 

Chapter 7 

Scheduling Optimizations 

Chapter 8 

Integer Optimizations 

Chapter 9 

Optimizing with SIMD Instructions 

Chapter 10 

x87 Floating-Point Optimizations 


Appendix A discusses the internal design, or microarchitecture, of the processor and provides 
specifications on the translation-lookaside buffers. It also provides information on other functional 
units that are not part of the main processor but are integrated on the chip. 

Appendix B describes the memory write-combining feature of the processor. 

Appendix C provides a complete listing of all AMD64 instructions. It shows each instruction’s 
encoding, decode type, execution latency, and—where applicable—the pipe used in the floating-point 
unit. 

Appendix D discusses optimizations that improve the throughput of AGP transfers. 

Appendix E describes coding practices that improve performance when using SSE and SSE2 
instructions. 
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Special Information 

Special information in this guide looks like this: 

This symbol appears next to the most important, or key, optimizations. 

Numbering Systems 

The following suffixes identify different numbering systems: 


This suffix 

Identifies a 

b 

Binary number. For example, the binary equivalent of the number 5 is written 101 b. 

d 

Decimal number. Decimal numbers are followed by this suffix only when the possibility of 
confusion exists. In general, decimal numbers are shown without a suffix. 

h 

Hexadecimal number. For example, the hexadecimal equivalent of the number 60 is 
written 3Ch. 


Typographic Notation 

This guide uses the following typographic notations for certain types of information: 


This type of text 

Identifies 

italic 

Placeholders that represent information you must provide. Italicized text is also used 
for the titles of publications and for emphasis. 

monowidth 

Program statements and function names. 


Providing Feedback 

If you have suggestions for improving this guide, we would like to hear from you. Please send your 
comments to the following e-mail address: 

code. optimization @ amd .com 


Chapter 1 
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1.4 Important New Terms 

This section defines several important terms and concepts used in this guide. 

Primitive Operations 

AMD Athlon 64 and AMD Opteron processors perform four types of primitive operations'. 

• Integer (arithmetic or logic) 

• Floating-point (arithmetic) 

• Load 

• Store 

Internal Instruction Formats 

The AMD64 instruction set is complex; instructions have variable-length encodings and many 
perform multiple primitive operations. AMD Athlon 64 and AMD Opteron processors do not execute 
these complex instructions directly, but, instead, decode them internally into simpler fixed-length 
instructions called macro-ops. Processor schedulers subsequently break down macro-ops into 
sequences of even simpler instructions called micro-ops , each of which specifies a single primitive 
operation. 

A macro-op is a fixed-length instruction that: 

• Expresses, at most, one integer or floating-point operation and one load and/or store operation. 

• Is the primary unit of work managed (that is, dispatched and retired) by the processor. 

A micro-op is a fixed-length instruction that: 

• Expresses one and only one of the primitive operations that the processor can perform (for 
example, a load). 

• Is executed by the processor’s execution units. 
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Table 1 summarizes the differences between AMD64 instructions, macro-ops, and micro-ops. 


Table 1. Instructions, Macro-ops and Micro-ops 


Comparing 

AMD64 instructions 

Macro-ops 

Micro-ops 

Complexity 

Complex 

A single instruction may 
specify one or more of 
each of the following 
operations: 

• Integer or floating-point 
operation 

• Load 

• Store 

Average 

A single macro-op may 
specify—at most—one 
integer or floating-point 
operation and one of the 
following operations: 

• Load 

• Store 

• Load and store to the 
same address 

Simple 

A single micro-op 
specifies only one of the 
following primitive 
operations: 

• Integer or floating-point 

• Load 

• Store 

Encoded length 

Variable (instructions are 
different lengths) 

Fixed (all macro-ops are 
the same length) 

Fixed (all micro-ops are 
the same length) 

Regularized 
instruction fields 

No (field locations and 
definitions vary among 
instructions) 

Yes (field locations and 
definitions are the same 
for all macro-ops) 

Yes (field locations and 
definitions are the same 
for all micro-ops) 


Types of Instructions 

Instructions are classified according to how they are decoded by the processor. There are three types 
of instructions: 


Instruction Type 

Description 

DirectPath Single 

A relatively common instruction that the processor decodes directly into one macro-op 
in hardware. 

DirectPath Double 

A relatively common instruction that the processor decodes directly into two macro¬ 
ops in hardware. 

VectorPath 

A sophisticated or less common instruction that the processor decodes into one or 
more (usually three or more) macro-ops using the on-chip microcode-engine ROM 
(MROM). 


Chapter 1 
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1.5 Key Optimizations 

While all of the optimizations in this guide help improve software performance, some of them have 
more impact than others. Optimizations that offer the most improvement are called key optimizations. 

Guideline 

Concentrate your efforts on implementing key optimizations before moving on to other optimizations, 
and incorporate higher-ranking key optimizations first. 

Key Optimizations by Rank 

Table 1 lists the key optimizations by rank. 

Table 2. Optimizations by Rank 


Rank 

Optimization 

Page 

1 

Memory-Size Mismatches 

92 

2 

Natural Alignment of Data Objects 

95 

3 

Memory Copy 

120 

4 

Density of Branches 

126 

5 

Prefetch Instructions 

104 

6 

Two-Byte Near-Return RET Instruction 

128 

7 

DirectPath Instructions 

72 

8 

Load-Execute Integer Instructions 

73 

9 

Load-Execute Floating-Point Instructions with Floating-Point Operands 

74 

10 

Load-Execute Floating-Point Instructions with Integer Operands 

74 

11 

Write-combining 

113 

12 

Branches That Depend on Random Data 

130 

13 

Half-Register Operations 

356 

14 

Placing Code and Data in the Same 64-Byte Cache Line 

116 
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Chapter 2 C and C++ Source-Level 

Optimizations 


Although C and C++ compilers generally produce very compact object code, many performance 
improvements are possible by careful source code optimization. Most such optimizations result from 
taking advantage of the underlying mechanisms used by C and C++ compilers to translate source 
code into sequences of AMD64 instructions. This chapter includes guidelines for writing C and C++ 
source code that result in the most efficiently optimized AMD64 code. 

This chapter covers the following topics: 


Topic 

Page 

Declarations of Floating-Point Values 

9 

Using Arrays and Pointers 

10 

Unrolling Small Loops 

13 

Expression Order in Compound Branch Conditions 

14 

Long Logical Expressions in If Statements 

16 

Arrange Boolean Operands for Guick Expression Evaluation 

17 

Dynamic Memory Allocation Consideration 

19 

Unnecessary Store-to-Load Dependencies 

20 

Matching Store and Load Size 

22 

SWITCH and Noncontiguous Case Expressions 

25 

Arranging Cases by Probability of Occurrence 

28 

Use of Function Prototypes 

29 

Use of const Type Oualifier 

30 

Generic Loop Hoisting 

31 

Local Static Functions 

34 

Explicit Parallelism in Code 

35 

Extracting Common Subexpressions 

37 

Sorting and Padding C and C++ Structures 

39 

Sorting Local Variables 

41 

Replacing Integer Division with Multiplication 

43 

Frequently Dereferenced Pointer Arguments 

44 

Array Indices 

46 

32-Bit Integral Data Types 

47 

Sign of Integer Operands 

48 


Chapter 2 


C and C++ Source-Level Optimizations 
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Topic 

Page 

Accelerating Floating-Point Division and Square Root 

50 

Fast Floating-Point-to-Integer Conversion 

52 

Speeding Up Branches Based on Comparisons Between Floats 

54 
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2.1 Declarations of Floating-Point Values 

Optimization 

When working with single precision (float) values: 

• Use the f or f suffix (for example, 3 . i4f) to specify a constant value of type float. 

• Use function prototypes for all functions that accept arguments of type float. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

C and C++ compilers treat floating-point constants and arguments as double precision (double) 
unless you specify otherwise. However, single precision floating-point values occupy half the 
memory space as double precision values and can often provide the precision necessary for a given 
computational problem. 


Chapter 2 


C and C++ Source-Level Optimizations 
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2.2 Using Arrays and Pointers 

Optimization 

Use array notation instead of pointer notation when working with arrays. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

C allows the use of either the array operator ([]) or pointers to access the elements of an array. 
However, the use of pointers in C makes work difficult for optimizers in C compilers. Without 
detailed and aggressive pointer analysis, the compiler has to assume that writes through a pointer can 
write to any location in memory, including storage allocated to other variables. (For example, *p and 
*q can refer to the same memory location, while x [o] and x [ 2 ] cannot.) Using pointers causes 
aliasing, where the same block of memory is accessible in more than one way. Using array notation 
makes the task of the optimizer easier by reducing possible aliasing. 


10 
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Example 

Avoid code, such as the following, which uses pointer notation: 

typedef struct { 

float x, y, z, w; 

} VERTEX; 

typedef struct { 
float m [4] [4] ; 

} MATRIX; 

void XForm(float *res, const float *v, const float *m, int numverts) { 

float dp; 
int i ; 

const VERTEX* vv = (VERTEX *)v; 

for (i = 0; i < numverts; i++) { 
dp = vv->x * *m++; 
dp += vv->y * *m++; 
dp += vv->z * *m++; 
dp += vv->w * *m++; 

*res++ = dp; // Write transformed x. 

dp = vv->x * *m++; 
dp += vv->y * *m++; 
dp += vv->z * *m++; 
dp += vv->w * *m++; 

*res++ = dp; // Write transformed y. 

dp = vv->x * *m++; 
dp += vv->y * *m++; 
dp += vv->z * *m++; 
dp += vv->w * *m++; 

*res++ = dp; // Write transformed z. 

dp = vv->x * *m++; 
dp += vv->y * *m++; 
dp += vv->z * *m++; 
dp += vv->w * *m++; 

*res++ = dp; // Write transformed w. 

++vv; // Next input vertex 

m -= 16; // Reset to start of transform matrix. 

} 

} 
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Instead, use the equivalent array notation: 

typedef struct { 

float x, y, z, w; 

} VERTEX; 

typedef struct { 
float m [4] [4] ; 

} MATRIX; 


void XForm(float *res, const float *v, const float *m, int numverts) { 
int i ; 

const VERTEX* vv = (VERTEX *)v; 
const MATRIX* mm = (MATRIX *)m; 

VERTEX* rr = (VERTEX *)res; 

for (i = 0; i < numverts; i++) { 


rr->x = 

vv - >x 

k 

mm->m [0] 

[0] 

+ 

vv - >y 

k 

mm->m[0] 

[1] 

+ 


vv- >z 

k 

mm->m [0] 

[2] 

+ 

vv - >w 

k 

mm->m[0] 

[3] 

r 

rr->y = 

vv - >x 

k 

mm->m[1] 

[0] 

+ 

vv - >y 

k 

mm->m[1] 

[1] 

+ 


vv- > z 

k 

mm->m [1] 

[2] 

+ 

vv - >w 

k 

mm->m[1] 

[3] 

r 

rr->z = 

vv - >x 

k 

mm->m [2] 

[0] 

+ 

vv - >y 

k 

mm->m[2] 

[1] 

+ 


vv- >z 

k 

mm->m [2] 

[2] 

+ 

vv - >w 

k 

mm->m[2] 

[3] 

r 

rr->w = 

vv - >x 

k 

mm->m [3] 

[0] 

+ 

vv - >y 

k 

mm->m[3] 

[1] 

+ 


vv- > z 

k 

mm->m [3] 

[2] 

+ 

vv - >w 

k 

mm->m[3] 

[3] 

r 

+ + rr ; 



// Increment 


the results pointer. 


+ +VV; 



// Increment 


the input vertex 

pointer 


} 

} 

Additional Considerations 

Source-code transformations interact with a compiler’s code generator, making it difficult to control 
the generated machine code from the source level. It is even possible that source-code transformations 
aimed at improving performance may conflict with compiler optimizations. Depending on the 
compiler and the specific source code, it is possible for pointer-style code to compile into machine 
code that is faster than that generated from equivalent array-style code. Compare the performance of 
your code after implementing a source-code transformation with the performance of the original code 
to be sure that there is an improvement. 


12 


C and C+ + Source-Level Optimizations 


Chapter 2 



25112 Rev. 3.06 September 2005 


_ AM PH 

Software Optimization Guide for AMD64 Processors 


2.3 Unrolling Small Loops 

Optimization 

Completely unroll loops that have a small fixed loop count and a small loop body. 


Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 


Rationale 

Many compilers do not aggressively unroll loops. Manually unrolling loops can benefit performance, 
especially if the loop body is small, which makes the loop overhead significant. 


Example 

Avoid a small loop like this: 

// 3D-transform: Multiply vector V by 4x4 transform matrix M. 
for (i = 0; i < 4; i++) { 

r[i] = 0; 

for (j = 0; j < 4; j++) { 

r [i] += m [ j ] [i] * v [j ] ; 

} 

} 

Instead, replace it with its completely unrolled equivalent, as shown here: 


r [0] 

= m [0] 

[0] 

* v [0] 

+ 

m [1] 

[0] 

* v [1] 

+ 

m [2] 

[0] 

* v [2] 

+ 

m [3] 

[0] 

* v [3] 

r [1] 

= m [0] 

[1] 

* v [0] 

+ 

m [1] 

[1] 

* v [1] 

+ 

m [2] 

[1] 

* v [2] 

+ 

m [3] 

[1] 

* v [3] 

r [2] 

= m [0] 

[2] 

* v [0] 

+ 

m [1] 

[2] 

* V [1] 

+ 

m [2] 

[2] 

* v [2] 

+ 

m [3] 

[2] 

* v [3] 

r [3] 

= m [0] 

[3] 

* v [0] 

+ 

m [1] 

[3] 

* V [1] 

+ 

m [2] 

[3] 

* v [2] 

+ 

m [3] 

[3] 

* v [3] 


Related information 

For information on loop unrolling at the assembly-language level, see “Loop Unrolling” on page 145. 
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2.4 Expression Order in Compound Branch 
Conditions 

Optimization 

In the most active areas of a program, order the expressions in compound branch conditions to take 
advantage of short circuiting of compound conditional expressions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Branch conditions in C programs often consist of compound conditions consisting of multiple 
boolean expressions joined by the logical AND (&&) and logical OR (| |) operators. C compilers 
guarantee short-circuit evaluation of these operators. In a compound logical OR expression, the first 
operand to evaluate to true terminates the evaluation, and subsequent operands are not evaluated at all. 
Similarly, in a logical AND expression, the first operand to evaluate to false terminates the evaluation. 
Because of this short-circuit evaluation, it is not always possible to swap the operands of logical OR 
and logical AND. This is especially true when the evaluation of one of the operands causes a side 
effect. However, in most cases the order of operands in such expressions is irrelevant. 

When used to control conditional branches, expressions involving logical OR and logical AND are 
translated into a series of conditional branches. The ordering of the conditional branches is a function 
of the ordering of the expressions in the compound condition and can have a significant impact on 
performance. It is impossible to give an easy, closed-form formula on how to order the conditions. 
Overall performance is a function of a variety of the following factors: 

• Probability of a branch misprediction for each of the branches generated 

• Additional latency incurred due to a branch misprediction 

• Cost of evaluating the conditions controlling each of the branches generated 

• Amount of parallelism that can be extracted in evaluating the branch conditions 

• Data stream consumed by an application (mostly due to the dependence of misprediction 
probabilities on the nature of the incoming data in data-dependent branches) 

It is recommended to experiment with the ordering of expressions in compound branch conditions in 
the most active areas of a program (so-called “hot spots,” where most of the execution time is spent). 
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Such hot spots can be found through the use of profiling by feeding a typical data stream to the 
program while doing the experiments. 
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2.5 Long Logical Expressions in If Statements 

Optimization 

In if statements, avoid long logical expressions that can generate dense conditional branches that 
violate the guideline described in “Density of Branches” on page 126. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Listing 1. Preferred for Data that Falls Mostly Within the Range 

if (a <= max && a >= min && b <= max && b >= min) 

If most of the data falls within the range, the branches will not be taken, so the above code is 
preferred. Otherwise, the following code is preferred. 

Listing 2. Preferred for Data that Does Not Fall Mostly Within the Range 

if (a > max || a < min || b > max [| b < min) 
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2.6 Arrange Boolean Operands for Quick Expression 
Evaluation 

Optimization 

In expressions that use the logical AND (&&) or logical OR (| |) operator, arrange the operands for 
quick evaluation of the expression: 


If the expression uses this 
operator 

Then arrange the operands from left to right in decreasing 
probablity of being 

ScSc (logical AND) 

False 

| | (logical OR) 

True 


Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

C and C++ compilers guarantee short-circuit evaluation of the boolean operators && and | |. In an 
expression that uses &&, the first operand to evaluate to false terminates the evaluation; subsequent 
operands are not evaluated. In an expression that uses II, the first operand to evaluate to true terminates 
the evaluation. 

When used to control program flow, expressions involving && and II are translated into a series of 
conditional branches. This optimization minimizes the total number of conditions evaluated and 
branches executed. 

Example 1 

In the following code, the operands of && are not arranged for quick expression evaluation because the 
first operand is not the condition case most likely to be false (it is far less likely for an animal name to 
begin with a ‘y’ than for it to have fewer than four characters): 

char animalname[30]; 
char *p; 

p = animalname; 

if ((strlen(p) > 4) && (*p == 'y')) { ... } 
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Because the odds that the animal name begins with a ‘y’ are comparatively low, it is better to put that 
operand first: 

if ((*p == 'y') && (strlen(p) >4)) { ... } 

Example 2 

In the following code (assuming a uniform random distribution of i), the operands of | | are not 
arranged for quick expression evaluation because the first operand is not the condition most likely to 
be true: 

unsigned int i; 

if ((i < 4) || (i & 1)) { ... } 

Because it is more likely for the least-significant bit of i to be 1, it is better to put that operand first: 

if ((i & 1) || (i < 4)) { ... } 
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2.7 Dynamic Memory Allocation Consideration 

Dynamic memory allocation—accomplished through the use of the malloc library function in C— 
should always return a pointer that is suitably aligned for the largest base type (quadword alignment). 
Where this aligned pointer cannot be guaranteed, use the technique shown in the following code to 
make the pointer quadword aligned, if needed. This code assumes that it is possible to cast the pointer 
to a long. 

double *p; 
double *np; 

p = (double *)malloc(sizeof(double) * number_of_doubles + 7L) ; 
np = (double *) ( ( ( (long) (p) ) + 7L) & (-8L) ) ; 

Then use np instead of p to access the data. The pointer p is still needed in order to deallocate the 
storage. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 
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2.8 Unnecessary Store-to-Load Dependencies 

A store-to-load dependency exists when data is stored to memory, only to be read back shortly 
thereafter. For details, see “Store-to-Load Forwarding Restrictions” on page 100. The 
AMD Athlon™ 64 and AMD Opteron™ processors contain hardware to accelerate such store-to-load 
dependencies, allowing the load to obtain the store data before it has been written to memory. 
However, it is still faster to avoid such dependencies altogether and keep the data in an internal 
register. 

Avoiding store-to-load dependencies is especially important if they are part of a long dependency 
chain, as may occur in a recurrence computation. If the dependency occurs while operating on arrays, 
many compilers are unable to optimize the code in a way that avoids the store-to-load dependency. In 
some instances the language definition may prohibit the compiler from using code transformations 
that would remove the store-to-load dependency. Therefore, it is recommended that the programmer 
remove the dependency manually, for example, by introducing a temporary variable that can be kept 
in a register, as in the following example. This can result in a significant performance increase. 

Listing 3. Avoid 

double x[VECLEN], y[VECLEN], z[VECLEN]; 
unsigned int k; 

for (k = 1; k < VECLEN; k++) { 

x [k] = x [k-1 ] + y [k] ; 

} 

for (k = 1; k < VECLEN; k++) { 

x [k] = z [k] * (y [k] - x[k-l]) ; 

} 

Listing 4. Preferred 

double x[VECLEN], y[VECLEN], z[VECLEN]; 
unsigned int k; 
double t; 

t = x [0] ; 

for (k = 1; k < VECLEN; k++) { 

t = t + y [k] ; 
x[k] = t; 

} 

t = x [0] ; 

for (k = 1; k < VECLEN; k++) { 

t = z [k] * (y [k] - t) ; 

x[k] = t; 

} 
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Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 
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2.9 Matching Store and Load Size 

Optimization 

Align memory accesses and match addresses and sizes of stores and dependent loads. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The AMD Athlon 64 and AMD Opteron processors contain a load-store buffer to speed up the 
forwarding of store data to dependent loads. However, this store-to-load forwarding (STLF) inside the 
load-store buffer occurs, in general, only when the addresses and sizes of the store and the dependent 
load match, and when both memory accesses are aligned. For details, see “Store-to-Load Forwarding 
Restrictions” on page 100. 

It is impossible to control load and store activity at the source level so as to avoid all cases that violate 
restrictions placed on store-to-load-forwarding. In some instances it is possible to spot such cases in 
the source code. Size mismatches can easily occur when different-size data items are joined in a 
union. Address mismatches could be the result of pointer manipulation. 

The following examples show a situation involving a union of different-size data items. The examples 
show a user-defined unsigned 16.16 fixed-point type and two operations defined on this type. 
Function f ixed_add adds two fixed-point numbers, and function f ixed_int extracts the integer 
portion of a fixed-point number. Listing 5 shows an inappropriate implementation of f ixed int, 
which, when used on the result of f ixed_add, causes misalignment, address mismatch, or size 
mismatch between memory operands, such that no store-to-load forwarding in the load-store buffer 
takes place. Listing 6 shows how to properly implement f ixed int in order to allow store-to-load 
forwarding in the load-store buffer. 

Examples 

Listing 5. Avoid 

typedef union { 

unsigned int whole; 
struct { 

unsigned short frac; /* Lower 16 bits are fraction. */ 
unsigned short intg; /* Upper 16 bits are integer. */ 

} parts; 

} FIXED_U_16_16; 
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_inline FIXED_U_16_16 fixed_add(FIXED_U_16_16 x, FIXED_U_16_16 y) { 

FIXED_U_16_16 Z; 

z.whole = x.whole + y.whole; 

return (z) ; 

} 

_inline unsigned int fixed_int(FIXED_U_16_16 x) { 

return((unsigned int)(x.parts.intg)); 

} 

FIXED_U_16_16 y, z; 
unsigned int q; 

labell: 

y = fixed_add (y, z); 
q = fixed_int (y); 

label2: 

The object code generated for the source code between labell and label 2 typically follows one of 
these two variants: 

; Variant 1 

mov edx, DWORD PTR [z] 

mov eax, DWORD PTR [y] ; -+ 

add eax, edx 

mov DWORD PTR [y], eax ; 

mov EAX, DWORD PTR [y+2] ; <+ Address mismatch--no forwarding in LSU 

and EAX, OFFFFh 

mov DWORD PTR [q], eax 

; Variant 2 

mov edx, DWORD PTR [z] 

mov eax, DWORD PTR [y] ; -+ 

add eax, edx ; 

mov DWORD PTR [y], eax ; 

movzx eax, WORD PTR [y+2] ; <+ Size and address mismatch--no forwarding in LSU 

mov DWORD PTR [q], eax 

Listing 6. Preferred 

typedef union { 

unsigned int whole; 
struct { 

unsigned short frac; /* Lower 16 bits are fraction. */ 
unsigned short intg; /* Upper 16 bits are integer. */ 

} parts; 

} FIXED_U_16_16; 

_inline FIXED_U_16_16 fixed_add(FIXED_U_16_16 x, FIXED_U_16_16 y) { 
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FIXED_U_16_16 Z; 

z.whole = x.whole + y.whole; 

return(z) ; 

} 

_inline unsigned int fixed_int(FIXED_U_16_16 x) { 

return (x.whole >> 16); 

} 

FIXED_U_16_16 y, z; 
unsigned int q; 

labell: 

y = fixed_add (y, z); 
q = fixed_int (y); 

label2: 

The object code generated for the source code between labell and label 2 typically looks like this: 

mov edx, DWORD PTR [z] 

mov eax, DWORD PTR [y] 

add eax, edx 

mov DWORD PTR [y], eax ; -+ 

mov eax, DWORD PTR [y] ; <+ Aligned (size/address match)--forwarding in LSU 

shr eax, 16 

mov DWORD PTR [q], eax 
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2.10 SWITCH and Noncontiguous Case Expressions 

Optimization 

Use if-else statements in place of switch statements that have noncontiguous case expressions. 
(Case expressions are the individual expressions to which the single switch expression is compared.) 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

If the case expressions are contiguous or nearly contiguous integer values, most compilers translate 
the switch statement as a jump table instead of a comparison chain. Jump tables generally improve 
performance because: 

• They reduce the number of branches to a single procedure call. 

• The size of the control-flow code is the same no matter how many cases there are. 

• The amount of control-flow code that the processor must execute is the same for all values of the 
switch expression. 

However, if the case expressions are noncontiguous values, most compilers translate the switch 
statement as a comparison chain. Comparison chains are undesirable because: 

• They use dense sequences of conditional branches, which interfere with the processor’s ability to 
successfully perform branch prediction. 

• The size of the control-flow code increases with the number of cases. 

• The amount of control-flow code that the processor must execute varies with the value of the 
switch expression. 
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Example 1 

A switch statement like this one, whose case expressions are contiguous integer values, usually 
provides good performance: 

switch (grade) 

{ 

case 'A': 

break; 
case 'B': 

break; 
case 'C': 

break; 
case 'D': 

break; 
case 'F': 

break; 

} 

Example 2 

Because the case expressions in the following switch statement are not contiguous values, the 
compiler will likely translate the code into a comparison chain instead of a jump table: 

switch (a) 

{ 

case 8 : 

// Sequence for a==8 
break; 
case 16: 

// Sequence for a==16 
break; 

default: 

// Default sequence 
break; 

} 
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To avoid a comparison chain and its undesirable effects on branch prediction, replace the switch 
statement with a series of if-else statements, as follows: 

if (a==8) { 

// Sequence for a==8 

} 

else if (a==16) { 

// Sequence for a==16 

} 

else { 

// Default sequence 

} 

Related Information 

For information on preventing branch-prediction interference at the assembly-language level, see 
“Density of Branches” on page 126. 
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2.11 Arranging Cases by Probability of Occurrence 

Optimization 

Arrange switch statement cases by probability of occurrence, from most probable to least probable. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Arranging switch statement cases by probability of occurrence improves performance when the 
switch statement is translated as a comparison chain; this arrangement has no negative impact when 
the statement is translated as a jump table. 

Example 

Avoid switch statements such as the following, in which the cases are not arranged by probability of 
occurrence: 

int days_in_month, short_months, normal_months, long_months; 

switch (days_in_month) { 
case 28: 

case 29: short_months++; break; 
case 30: normal_months++; break; 
case 31: long_months++; break; 

default: printf("Month has fewer than 28 or more than 31 days.\n"); 

} 

Instead, arrange the cases to test for frequently occurring values first: 

switch (days_in_month) { 

case 31: long_months++; break; 
case 30: normal_months++; break; 
case 28: 

case 29: short_months++; break; 

default: printf("Month has fewer than 28 or more than 31 days.\n"); 

} 
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2.12 Use of Function Prototypes 

Optimization 

In general, use prototypes for all functions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Prototypes can convey additional information to the compiler that might enable more aggressive 
optimizations. 
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2.13 Use of const Type Qualifier 

Optimization 

For objects whose values will not be changed, use the const type qualifier. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Using the const type qualifier makes code more robust and may enable the compiler to generate 
higher-performance code. For example, under the C standard, a compiler is not required to allocate 
storage for an object that is declared const, if its address is never used. 
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2.14 Generic Loop Hoisting 

Optimization 

To improve the performance of inner loops, reduce redundant constant calculations (that is, loop- 
invariant calculations). This idea can also be extended to invariant control structures. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale and Examples 

The following example demonstrates the use of an invarient condition in an if statement in a for 
loop. The second listing shows the preferred optimization. 

Listing 7. (Avoid) 

for (i . . . ) { 

if (CONSTANT0) { 

DoWorkO(i); // Does not affect CONSTANTO. 

} 

else { 

DoWorkl(i); // Does not affect CONSTANTO. 

} 

} 

Listing 8. (Preferred Optimzation) 

if (CONSTANTO) { 
for (i. . . ) { 

DoWorkO(i); 

} 

} 

else { 

for (i. . . ) { 

DoWorkl(i); 

} 

} 

The preferred optimization in Listing 8 tightens the inner loops by avoiding repetitious evaluation of a 
known if control structure. Although the branch would be easily predicted, the extra instructions and 
decode limitations imposed by branching (which are usually advantageous) are saved. 
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To generalize the example in Listing 8 further for multiple-constant control code, more work may be 
needed to create the proper outer loop. Enumeration of the constant cases reduces this to a simple 
switch statement. 


Listing 9. 

for (i . . . ) { 

if (CONSTANT0) 
DoWorkO(i); 

} 

else { 

DoWorkl(i); 

} 

if (CONSTANT1) 
DoWork2(i); 

} 

else { 

DoWork3(i); 

} 

} 


{ 

// Does not affect CONSTANTO or CONSTANT1. 

// Does not affect CONSTANTO or CONSTANT1. 

{ 

// Does not affect CONSTANTO or CONSTANT1. 

// Does not affect CONSTANTO or CONSTANTl. 
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Transform the loop in Listing 9 (by using the switch statement) into: 

#define combine(cl, c2) (((cl) << 1) + (c2)) 

switch (combine(CONSTANTO ! = 0, CONSTANT1 != 0)) { 

case combine(0, 0): 
for(i...) { 

DoWorkO (i) ; 

DoWork2 (i) ; 

} 

break; 

case combined, 0) : 
for(i...) { 

DoWorkl (i) ; 

DoWork2 (i) ; 

} 

break; 

case combine(0, 1) : 
for(i...) { 

DoWorkO (i) ; 

DoWork3 (i) ; 

} 

break; 

case combine( 1, 1 ) : 
for(i...) { 

DoWorkl (i) ; 

DoWork3 (i) ; 

} 

break; 
default: 
break; 

} 

Some introductory code is necessary to generate all the combinations for the switch constant and the 
total amount of code has doubled. However, the inner loops are now free of if statements. In ideal 
cases where the DoWorkn. functions are inlined, the successive functions have greater overlap, leading 
to greater parallelism than possible in the presence of intervening if statements. 

The same idea can be applied to constant switch statements or to combinations of switch statements 
and if statements inside of for loops. The method used to combine the input constants becomes 
more complicated but benefits performance. 

However, the number of inner loops can also substantially increase. If the number of inner loops is 
prohibitively high, then only the most common cases must be dealt with directly, and the remaining 
cases can fall back to the old code in the default clause of the switch statement. This situation is 
typical of run-time generated code. While the performance of run-time generated code can be 
improved by means similar to those presented here, it is much harder to maintain and developers must 
do their own code-generation optimizations without the help of an available compiler. 
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2.15 Local Static Functions 

Optimization 

Declare as static functions that are not used outside the file where they are defined. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Declaring a function as static forces internal linkage. Functions that are not declared as static 
default to external linkage, which may inhibit certain optimizations—for example, aggressive 
inlining—with some compilers. 
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2.16 Explicit Parallelism in Code 

Optimization 

Where possible, break long dependency chains into several independent dependency chains that can 
then be executed in parallel, exploiting the execution units in each pipeline. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale and Examples 

This is especially important to break long dependency chains into smaller executing units in floating¬ 
point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of 
floating-point operations. Because most languages (including ANSI C) guarantee that floating-point 
expressions are not reordered, compilers cannot usually perform such optimizations unless they offer 
a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules. 

Reordered code that is algebraically identical to the original code does not necessarily produce 
identical computational results due to the lack of associativity of floating-point operations. There are 
well-known numerical considerations in applying these optimizations (consult a book on numerical 
analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of 
cases, the final result differs only in the least-significant bits. 

Listing 10. Avoid 

double a [100], sum; 
int i ; 

sum = 0.0 f; 

for (i = 0; i < 100; i++) { 

sum += a [i] ; 

} 
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Listing 11. Preferred 

double a[100], suml, sum2, sum3, sum4, sum; 
int i ; 

suml = 0.0; 
sum2 = 0.0; 
sum3 = 0.0; 
sum4 = 0.0; 

for (i = 0; i < 100; i+4) { 

suml += a [i] ; 
sum2 += a [i + 1] ; 
sum3 + = a [i + 2] ; 
sum4 += a[i+3]; 

} 

sum = (sum4 + sum3) + (suml + sum2); 

Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point 
adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum 
sustained utilization. 
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2.17 Extracting Common Subexpressions 

Optimization 

Manually extract common subexpressions where C compilers may be unable to extract them from 
floating-point expressions due to the guarantee against reordering of such expressions in the ANSI 
standard. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Specifically, the compiler cannot rearrange the computation according to algebraic equivalencies 
before extracting common subexpressions. Rearranging the expression may give different 
computational results due to the lack of associativity of floating-point operations, but the results 
usually differ in only the least-significant bits. 

Examples 

Listing 12. Avoid 

double a, b, c, d, e, f; 

e = b * c / d; 

f = b / d * a; 

Listing 13. Preferred 

double a, b, c, d, e, f, t; 

t=b/d; 
e = c * t ; 
f = a * t ; 

Listing 14. Avoid 

double a, b, c, e, f; 

e = a / c ; 

f = b / C; 
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Listing 15. Example 2 (Preferred) 

double a, b, c, e, f, t; 

t = 1 / c; 

e = a * t 

f = b * t ; 
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2.18 Sorting and Padding C and C++ Structures 

Optimization 

Sort and pad C and C++ structures to achieve natural alignment. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

In order to achieve better alignment for structures, many compilers have options that allow padding of 
structures to make their sizes multiples of words, doublewords, or quadwords. In addition, to improve 
the alignment of structure members, some compilers might allocate structure elements in an order that 
differs from the order in which they are declared. However, some compilers might not offer any of 
these features, or their implementations might not work properly in all situations. 

By sorting and padding structures at the source-code level, if the first member of a structure is 
naturally aligned, then all other members are naturally aligned as well. This allows, for example, 
arrays of structures to be perfectly aligned. 

Sorting and Padding C and C++ Structures 

To sort and pad a C or C++ structure, follow these steps: 

1. Sort the structure members according to their type sizes, declaring members with larger type sizes 
ahead of members with smaller type sizes. 

2. Pad the structure so the size of the structure is a multiple of the largest member’s type size. 

Examples 

Avoid structure declarations in which the members are not declared in order of their type sizes and the 
size of the structure is not a multiple of the size of the largest member’s type: 

struct { 

char a[5]; \\ Smallest type size (1 byte * 5) 

long k; \\ 4 bytes in this example 

double x; \\ Largest type size (8 bytes) 

} baz ; 
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Instead, declare the members according to their type sizes (largest to smallest) and add padding to 
ensure that the size of the structure is a multiple of the largest member’s type size: 


struct { 


double x; 

\\ 

long 

k; 

\\ 

char 

a [5] ; 

\\ 

char 

pad [ 7 ] ; 

\\ 


} baz ; 


Largest type size (8 bytes) 

4 bytes in this example 
Smallest type size (1 byte * 5) 

Make structure size a multiple of 8. 


40 


C and C+ + Source-Level Optimizations 


Chapter 2 



25112 Rev. 3.06 September 2005 


_ AM PH 

Software Optimization Guide for AMD64 Processors 


2.19 Sorting Local Variables 

Optimization 

Sort local variables according to their type sizes, declaring those with larger type sizes ahead of those 
with smaller type sizes. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

It can be helpful to presort local variables, if your compiler allocates local variables in the same order 
in which they are declared in the source code. If the first variable is allocated for natural alignment, all 
other variables are allocated contiguously in the order they are declared and are naturally aligned 
without padding. 

Some compilers do not allocate variables in the order they are declared. In these cases, the compiler 
should automatically allocate variables that are naturally aligned with the minimum amount of 
padding. In addition, some compilers do not guarantee that the stack is aligned suitably for the largest 
type (that is, they do not guarantee quadword alignment), so that quadword operands might be 
misaligned, even if this technique is used and the compiler does allocate variables in the order they 
are declared. 

Example 

Avoid local variable declarations, when the variables are not declared in order of their type sizes: 


short 

ga, gu, gi; 

long 

foo, bar; 

double 

x, y, z [3] ; 

char 

a, b; 

float 

baz ; 

Instead, 

sort the declarations ; 

double 

z [3] ; 

double 

x, y; 

long 

foo, bar; 

float 

baz ; 

short 

ga, gu, gi; 
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Related Information 

For information on sorting local variables at the assembly-language level, see “Sorting Local 
Variables” on page 119. 
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2.20 Replacing Integer Division with Multiplication 

Optimization 

Replace integer division with multiplication when there are multiple divisions in an expression. (This 
is possible only if no overflow will occur during the computation of the product. The possibility of an 
overflow can be determined by considering the possible ranges of the divisors.) 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Integer division is the slowest of all integer arithmetic operations. 

Examples 

Avoid code that uses two integer divisions: 
int i, j , k, m; 
m = i / j / k; 

Instead, replace one of the integer divisions with the appropriate multiplication: 

m = i / (j * k) ; 
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2.21 Frequently Dereferenced Pointer Arguments 

Optimization 

Avoid dereferenced pointer arguments inside a function. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Because the compiler has no knowledge of whether aliasing exists between the pointers, such 
dereferencing cannot be “optimized away” by the compiler. Since data may not be maintained in 
registers, memory traffic can significantly increase. 

Many compilers have an “assume no aliasing” optimization switch. This allows the compiler to 
assume that two different pointers always have disjoint contents and does not require copying of 
pointer arguments to local variables. If your compiler does not have this type of optimization, then 
copy the data pointed to by the pointer arguments to local variables at the start of the function and if 
necessary copy them back at the end of the function. 

Examples 

Listing 16. Avoid 

// Assumes pointers are different and q != r. 

void isqrt(unsigned long a, unsigned long *q, unsigned long *r) { 

*q = a; 
if (a > 0) { 

while (*q > (*r = a / *q) ) { 

*q = (*q + *r) >> 1; 

} 

} 

*r = a - *q * *q; 

} 
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Listing 17. Preferred 

// Assumes pointers are different and q != r. 

void isqrt(unsigned long a, unsigned long *q, unsigned long *r) { 


unsigned long qq, rr; 
qq = a; 
if (a > 0) { 

while (qq > (rr = a / qq)) { 




qq 

= (qq + rr 


} 



} 




rr 

= 

a - 

qq * qq; 

*q 

= 

qq; 


*r 

= 

rr; 



} 
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2.22 Array Indices 

Optimization 

The preferred type for array indices is ptrdiff_t. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Array indices are often used with pointers while doing arithmetic. Using ptrdiff_t produces more 
portable code and will generally provide good performance. 
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2.23 32-Bit Integral Data Types 

Optimization 

Use 32-bit integers instead of integers with smaller sizes (16-bit or 8-bit). 

Application 

This optimization applies to 32-bit software. 

Rational 

Be aware of the amount of storage associated with each integral data type. 
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2.24 Sign of Integer Operands 

Optimization 

Where there is a choice of using either a signed or an unsigned type, take into consideration that some 
operations are faster with unsigned types while others are faster for signed types. 

Application 

This optimization applies to: 

• 32-bit software 

Rationale 

In many cases, the type of data to be stored in an integer variable determines whether a signed or an 
unsigned integer type is appropriate. For example, to record the weight of a person in pounds, no 
negative numbers are required, so an unsigned type is appropriate. However, recording temperatures 
in degrees Celsius may require both positive and negative numbers, so a signed type is needed. 

Integer-to-floating-point conversion using integers larger than 16 bits is faster with signed types, as 
the AMD64 architecture provides instructions for converting signed integers to floating-point but has 
no instructions for converting unsigned integers. In a typical case, a 32-bit integer is converted by a 
compiler to assembly as follows: 

Examples 

Listing 18. (Avoid) 

double x; ====> mov [temp+4], 0 

unsigned int i; mov eax, i 

mov [temp], eax 

x = i; fild QWORD PTR [temp] 

fstp QWORD PTR [x] 

The preceding code is slow not only because of the number of instructions, but also because a size 
mismatch prevents store-to-load forwarding to the FILD instruction. Instead, use the following code: 

Listing 19. (Preferred) 

double x; ====> fild DWORD PTR [i] 
int i; fstp QWORD PTR [x] 

x = i; 

Computing quotients and remainders in integer division by constants is faster when performed on 
unsigned types. The following typical case is the compiler output for a 32-bit integer divided by 4: 
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Listing 20. Example 2 (Avoid) 

int i; ====> mov eax, i 

cdq 

i = i / 4; and edx, 3 

add eax, edx 
sar eax, 2 
mov i, eax 

Listing 21. Example 2 (Preferred) 

unsigned int i; ====> shr i, 2 

i = i / 4; 

In summary, use unsigned types for: 

• Division and remainders 

• Loop counters 

• Array indexing 

Use signed types for: 

• Integer-to-floating-point conversion 
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2.25 Accelerating Floating-Point Division and Square 
Root 

Optimization 

In applications that involve the heavy use of single precision division and square root operations, it is 
recommended that you port the code to SSE or 3DNow!™ inline assembly or use a compiler that can 
generate SSE or 3DNow! technology code. If neither of these methods are possible, the x87 FPU 
control word register precision control specification bits (PC) can be set to single precision to improve 
performance. (The processor defaults to double-extended precision. See AMD64 Architecture 
Programmer’s Manual Volume 1: Application Programming (order# 24592) for details on the FPU 
control register.) 

Application 

This optimization applies to 32-bit software. 

Rationale 

Division and square root have a much longer latency than other floating-point operations, even though 
the AMD Athlon 64 and AMD Opteron processors provide significant acceleration of these two 
operations. In some application programs, these operations occur so often as to seriously impact 
performance. If code has hot spots that use single precision arithmetic only (that is, all computation 
involves data of type float) and for some reason cannot be ported to 3DNow! code, the following 
technique may be used to improve performance. 

The x87 FPU has a precision-control field as part of the FPU control word. The precision-control 
setting determines rounding precision of instruction results and affects the basic arithmetic 
operations, including division and the extraction of square root. Division and square root on the 
AMD Athlon 64 and AMD Opteron processors are only computed to the number of bits necessary for 
the currently selected precision. Setting precision control to single precision (versus the Win32 
default of double precision) lowers the latency of those operations. 

The Microsoft® Visual C environment provides functions to manipulate the FPU control word and 
thus the precision control. Note that these functions are not very fast, so insert changes of precision 
control where it creates little overhead, such as outside a computation-intensive loop. Otherwise, the 
overhead created by the function calls outweighs the benefit from reducing the latencies of divide and 
square-root operations. For more information on this topic, see AMD64 Architecture Programmer's 
Manual Volume 1: Application Programming (order# 24592). 

The following example shows how to set the precision control to single precision and later restore the 
original settings in the Microsoft Visual C environment. 
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Examples 

Listing 22. 

/* Prototype for _controlfp function */ 

#include <float.h> 
unsigned int orig_cw; 

/* Get current FPU control word and save it. */ 
orig_cw = _controlfp(0, 0); 

/* Set precision control in FPU control word to single precision. 
This reduces the latency of divide and square-root operations. */ 
_controlfp(_PC_24, MCW_PC); 

/* Restore original FPU control word. */ 

_controlfp(orig_cw, Oxfffff) ; 
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2.26 Fast Floating-Point-to-Integer Conversion 

Optimization 

Use 3DNow! PF2ID instruction to perform truncating conversion to accomplish rapid floating-point- 
to-integer conversion, if the floating-point operand is a type float. 

Application 

This optimization applies to 32-bit software. 

Rationale 

Floating-point-to-integer conversion in C programs is typically a very slow operation. The semantics 
of C and C++ demand that the conversion use truncation. If the floating-point operand is of type 
float, and the compiler supports 3DNow! code generation, then the 3DNow! PF2ID instruction, 
which performs truncating conversion, can be utilized by the compiler to accomplish rapid floating- 
point-to-integer conversion. 

Note: The PF2ID instruction does not provide conversion compliant with the IEEE-754 standard. 
Some operands of type float (IEEE-754 single precision) such as NaNs, infinities, and 
denormals, are either unsupported or not handled in compliance with the IEEE-754 standard 
by 3DNow! technology. 

For double precision operands, the usual way to accomplish truncating conversion involves the 
following algorithm: 

1. Save the current x87 rounding mode (this is usually round to nearest or even). 

2. Set the x87 rounding mode to truncation. 

3. Load the floating-point source operand and store the integer result. 

4. Restore the original x87 rounding mode. 

This algorithm is typically implemented through the C run-time library function f tol. While the 
AMD Athlon 64 and AMD Opteron processors have special hardware optimizations to speed up the 
changing of x87 rounding modes and therefore f tol, calls to f tol may still tend to be slow. 

For situations where very fast floating-point-to-integer conversion is required, the conversion code in 
Listing 24 on page 53 may be helpful. This code uses the current rounding mode instead of truncation 
when performing the conversion. Therefore, the result may differ by 1 from the f tol result. The 
replacement code adds the “magic number” 2 +2 to the source operand, then stores the double 
precision result to memory and retrieves the lower doubleword of the stored result. Adding the magic 
number shifts the original argument to the right inside the double precision mantissa, placing the 
binary point of the sum immediately to the right of the least-significant mantissa bit. Extracting the 
lower doubleword of the sum then delivers the integral portion of the original argument. 
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The following conversion code causes a 64-bit store to feed into a 32-bit load. The load is from the 
lower 32 bits of the 64-bit store, the one case of size mismatch between a store and a dependent load 
that is specifically supported by the store-to-load-forwarding hardware of the AMD Athlon 64 and 
AMD Opteron processors. 

Examples 

Listing 23. Slow 

double x; 
int i ; 

i = x; 

Listing 24. Fast 

#define DOUBLE2INT(i, d) \ 

{double t = ((d) + 6755399441055744.0); i = *((int *)(&t));} 

double x; 
int i ; 

DOUBLE2INT(i, x); 
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2.27 Speeding Up Branches Based on Comparisons 
Between Floats 

Optimization 

Store operands of type float into a memory location and use integer comparison with the memory 
location to perform fast branches in cases where compilers do not support fast floating-point 
comparison instructions or 3DNow! code generation. 

Application 

This optimization applies to 32-bit software. 

Rationale 

Branches based on floating-point comparisons are often slow. The AMD Athlon 64 and 
AMD Opteron processors support the FCOMI, FUCOMI, FCOMIP, and FUCOMIP instructions that 
allow implementation of fast branches based on comparisons between operands of type double or 
type float. However, many compilers do not support generating these instructions. Likewise, 
floating-point comparisons between operands of type float can be accomplished quickly by using 
the 3DNow! PFCMP instruction if the compiler supports 3DNow! code generation. 

Many compilers only implement branches based on floating-point comparisons by using FCOM or 
FCOMP to compare the floating-point operands, followed by fstsw ax in order to transfer the x87 
condition-code flags into EAX. The subsequent branch is then based on the contents of the EAX 
register. Although the AMD Athlon 64 and AMD Opteron processors have acceleration hardware to 
speed up the FSTSW instruction, this process is still fairly slow. 

Branches Dependent on Integer Comparisons Are Fast 

One alternative for branches dependent upon the outcome of the comparison of operands of type 
float is to store the operand(s) into a memory location and then perform an integer comparison with 
that memory location. Branches dependent on integer comparisons are very fast. It should be noted 
that the replacement code uses a load dependent on an immediately prior store. If the store is not 
doubleword-aligned, no store-to-load-forwarding takes place, and the branch is still slow. Also, if 
there is a lot of activity in the load-store queue, forwarding of the store data may be somewhat 
delayed, thus negating some of the advantages of using the replacement code. It is recommended that 
you experiment with the replacement code to test whether it actually provides a performance increase 
in the code at hand. 

The replacement code works well for comparisons against zero, including correct behavior when 
encountering a negative zero as allowed by the IEEE-754 standard. It also works well for comparing 


54 


C and C+ + Source-Level Optimizations 


Chapter 2 



25112 Rev. 3.06 September 2005 


_ AM PH 

Software Optimization Guide for AMD64 Processors 


to positive constants. In that case, the user must first determine the integer representation of that 
floating-point constant. This can be accomplished with the following C code snippet: 

float X; 

scanf("%g", &x); 

printf("%08X\n", (*((int *)(&x)))); 

The replacement code is IEEE-754 compliant for all classes of floating-point operands except NaNs. 
However, NaNs do not occur in properly working software. 

Examples 

Intial definitions: 

#define FLOAT2INTCAST(f) (*((int *)(&f))) 

#define FLOAT2UINTCAST(f) (*((unsigned int *)(&f))) 


Table 3: Comparisons against Zero 


Use this ... 

Instead of this. 

if (FLOAT2UINTCAST(f) > 0x80000000U) 

if (f < 0.Of) 

if (FL0AT2INCAST(f) <= 0) 

if (f <= O.Of) 

if (FL0AT2INTCAST(f) > 0) 

if (f > O.Of) 

if (FLOAT2UINTCAST(f) <= 0x80000000U) 

if (f >= O.Of) 


Table 4: Comparisons against Positive Constant 


Use this ... 

Instead of this. 

if (FLOAT2INTCAST(f) < 0x40400000) 

if (f < 3.Of) 

if (FLOAT2INTCAST(f) <= 0x40400000) 

if (f <= 3.Of) 

if (FLOAT2INTCAST(f) > 0x40400000) 

if (f > 3.Of) 

if (FLOAT2INTCAST(f) >= 0x40400000) 

if (f >= 3.Of) 


Table 5: Comparisons among Two Floats 


Use this ... 

Instead of this. 

float t = fl - f2; 

if (FLOAT2UINTCAST(t) > 0x80000000U) 

if (fl < f2) 
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Table 5: Comparisons among Two Floats 


Use this ... 

Instead of this. 

float t = fl - f2; 

if (FLOAT2INTCAST(t) <= 0) 

if (fl <= f2) 

float t = fl - f2; 

if (FLOAT2INTCAST(t) > 0) 

if (fl > f2) 

float t = fl - f2; 

f (FLOAT2UINTCAST(f) <= 0x80000000U) 

if (fl >= f2) 


56 


C and C+ + Source-Level Optimizations 


Chapter 2 




25112 Rev. 3.06 September 2005 


_ AM PH 

Software Optimization Guide for AMD64 Processors 


2.28 Improving Performance in Linux Libraries 

Optimization 

If interposition is not important to a particular application, then, if using Id in the binutils package, 
you can make use of a linker option that results in references to public global routines inside the 
library that cannot be overridden. 

Application This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Dynamically loadable libraries are a versatile feature of the Linux operating system. They allow one 
or more symbols in one library to override the same symbol in another library. Known as 
interposition, this ability makes customizations and probing seamless. Interposition is implemented 
by means of a procedure linkage table (PLT). The PLT is so flexible that even references to an 
overridden symbol inside the library end up referencing the overriding symbol. However, the PLT 
imposes a performance penalty by requiring all function calls to public global routines to go through 
an extra step that increases the chances of cache misses and branch mispredictions. This is 
particularly severe for C++ classes whose methods typically refer to other methods in the same class. 

Examples 

When using Id, include the following command line option: 

-Bsymbolic 

If using gcc to build a library, add this option to the command-line: 

-Wl,-Bsymbolic 
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Chapter 3 General 64-Bit Optimizations 


In long mode, the AMD64 architecture provides both a compatibility mode, which allows a 64-bit 
operating system to run existing 16-bit and 32-bit applications, and a 64-bit mode, which provides 
64-bit addressing and expanded register resources to support higher performance for recompiled 
64-bit programs. This chapter presents general optimizations that improve the performance of 
software designed to run in 64-bit mode. Therefore, all optimizations in this chapter apply only to 
64-bit software. 

This chapter covers the following topics: 


Topic 

Page 

64-Bit Registers and Integer Arithmetic 

60 

64-Bit Arithmetic and Large-Integer Multiplication 

62 

128-Bit Media Instructions and Floating-Point Operations 

67 

32-Bit Legacy GPRs and Small Unsigned Integers 

68 
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3.1 64-Bit Registers and Integer Arithmetic 

Optimization 

Use 64-bit registers for 64-bit integer arithmetic. 

Rationale 

Using 64-bit registers instead of their 32-bit equivalents can dramatically reduce the amount of code 
necessary to perform 64-bit integer arithmetic. 

Example 1 

This code performs 64-bit addition using 32-bit registers: 

; Add ECX:EBX to EDX:EAX, and place sum in EDX:EAX. 

00000000 03 C3 add eax, ebx 
00000002 13 D1 adc edx, ecx 

Using 64-bit registers, the previous code can be replaced by one simple instruction (assuming that 
RAX and RBX contain the 64-bit integer values to add): 

00000000 48 03 C3 add rax, rbx 

Although the preceding instruction requires one additional byte for the REX prefix, it is still one byte 
shorter than the original code. More importantly, this instruction still has a latency of only one cycle, 
uses two fewer registers, and occupies only one decode slot. 
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Example 2 

To perform the low-order half of the product of two 64-bit integers using 32-bit registers, a procedure 
such as the following is necessary: 


In: [ESP+8] : [ESP+4] = multiplicand 

[ESP+16] : [ESP+12] = multiplier 
Out: EDX:EAX = (multiplicand * multiplier) 

Destroys: EAX, ECX, EDX, EFlags 


2*64 


llmul PROC 

mov edx, [esp+8] 
mov ecx, [esp+16] 
or edx, ecx 
mov edx, [esp+12] 
mov eax, [esp+4] 
jnz twomul 
mul edx 
ret 


; multiplicand_hi 
; multiplier_hi 
; One operand >= 2*32? 

; multiplier_lo 
; multiplicand_lo 
; Yes, need two multiplies. 

; multiplicand_lo * multiplier_lo 
; Done, return to caller. 


twomul: 

imul edx, [esp+8] 

imul ecx, eax 

add ecx, edx 

mul dword ptr [esp+12] 

add edx, ecx 

ret 


p3_lo = multiplicand_hi * multiplier_lo 
p2_lo = multiplier_hi * multiplicand_lo 
p2_lo + p3_lo 

pi = multiplicand_lo * multiplier_lo 
pi + p2_lo + p3_lo = result in EDX:EAX 
Done, return to caller. 


llmul ENDP 

Using 64-bit registers, the entire product can be produced with only one instruction: 

; Multiply RAX by RBX. The 128-bit product is stored in RDX:RAX. 
00000000 48 F7 EB imul rbx 


Related Information 

For more examples of 64-bit arithmetic using only 32-bit registers, see “Efficient 64-Bit Integer 
Arithmetic in 32-Bit Mode” on page 170. 
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3.2 64-Bit Arithmetic and Large-Integer Multiplication 

Optimization 

Use 64-bit arithmetic for integer multiplication that produces 128-bit or larger products. 

Background 

Large-number multiplications (those involving 128-bit or larger products) are utilized in 
cryptography algorithms, which figure importantly in e-commerce applications and secure 
transactions on the Internet. Processors cannot perform large-number multiplication natively; they 
must break the operation into chunks that are permitted by their architecture (32-bit or 64-bit 
additions and multiplications). 

Rationale 

Using 64-bit rather than 32-bit integer operations dramatically reduces the number of additions and 
multiplications required to compute large products. For example, computing a 1024-bit product using 
64-bit arithmetic requires fewer than one quarter the number of instructions required when using 
32-bit operations: 


Comparing... 

32-bit arithmetic 

64-bit arithmetic 

Number of multiplications 

256 

64 

Number of additions with carry 

509 

125 

Number of additions 

255 

63 


In addition, the processor performs 64-bit additions just as fast as it performs 32-bit additions, and the 
latency of 64-bit multiplications is only slightly higher than for 32-bit multiplications. (The processor 
is capable of performing a 64-bit addition each clock cycle and a 64-bit multiplication every other 
clock cycle.) 

Example 

Consider the multiplication of two unsigned 64-bit numbers a and b , represented in terms of 32-bit 
numbers al'.aO and bl'.bO. 

a = al * 2 32 + aO 

b = bl*l 32 + bO 

The product of a and b , c, can be expressed in terms of products of the 32-bit components, as follows: 
c ={al* bl) * 2 64 + {al * bO + aO * bl) * 2 32 + (aO * bO) 
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Each of the products of the components of a and b (for example, al * bl ) is composed of 64 bits—an 
upper 32 bits and a lower 32 bits, it is convenient to represent these individual products as d, e,f and 
g, as follows: 

aO * bO = dl'.dO = dl * 2 32 + dO 
al * bO = el'.eO = el * 2 32 + eO 
aO * bl =fl:f0 =fl * 2 32 +f0 
al * bl = gl:g0 = gl * 2 32 + gO 
Substitution yields the following equation: 

c =(gl * 2 32 + gO) * 2 64 + (el * 2 32 + eO +fl * 2 32 + f0) * 2 32 + (dl * 2 32 + dO) 

Simplifying yields this equation: 

c = gl* 2 96 + (el +fl + gO) * 2 64 + (dl + eO +f0 ) * 2 32 + dO 

it is convenient to represent the terms that are multiplied by each power of 2 as c3, c2, cl, and cO, as 
follows: 

gl=c3 

el +fl + g 0 = c2 
dl + eO +fD = cl 
dO = cO 

Substituting again yields: 
c = c3 * 2 96 + c2* 2 64 + cl * 2 32 + cO 
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The following procedure performs 64-bit unsigned integer multiplication, as previously illustrated 
using 32-bit integer operations: 

; 32bitalu_64x64(int *a, int *b, int *c); 

I 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c 32bitalu_64x64.asm 

.586 
. K3D 
. XMM 

_DATA SEGMENT 
tempESP dd 0 
_DATA ENDS 
_TEXT SEGMENT 
ASSUME DS:_DATA 
PUBLIC _32bitalu_64x64 
32bitalu 64x64 PROC NEAR 


; Save the register 
; and assumed to be 
push ebp 
mov ebp, esp 

state. Registers EAX, ECX, and EDX are considered volatile 
changed, while the registers below must be preserved. 

; Parameters passed 

into routine: 


; [ebp+8] = ->a 



; [ebp+12] = ->b 



; [ebp+16] = ->c 



push ebx 



push esi 



push edi 



! 

mov esi, [ebp+8] 

; ESI = ->a 


mov edi, [ebp+12] 

; EDI = ->b 


mov ecx,[ebp+16] 

; ECX = ->C 


push ebp 



mov [tempESP], esp 



/ 

; Multiply 64-bit numbers a and b, each of which is composed 

of two 32-bit 

; components: 



; a = al * 2^32 + aO 


; b = bl * 2^32 + bO 


mov eax, [esi] 

EAX = aO 


mov edx,[edi] 

EDX = bO 


mul edx ; 

EDX:EAX = a0*b0 = dl:dO 


mov ebx,edx ; 

EDX = dl 


mov [ecx],eax 

CO = EAX 


xor esp,esp ; 

ESP = 0 


xor ebp,ebp 

EBP = 0 
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mov 

eax, [esi+4] 

; EAX = al 


mov 

edx, [edi] 

; EDX = bo 


mul 

edx 

; EDX:EAX = al*b0 = el:e0 


add 

ebx,eax 

; EBX = dl + e0 


adc 

ebp,edx 

; EBP = el + possible carry from 

adc 

esp, 0 

; Collect possible carry into 

c3 

mov 

eax, [esi] 

; EAX = ao 


mov 

edx, [edx+4] 

; EDX = bl 


mul 

edx 

; EDX:EAX = a0*bl = f1 : f0 


add 

ebx,eax 

; EBX = dl + e0 + fO 


adc 

ebp,edx 

; EBP = el + fl + carry 


adc 

esp, 0 

; Collect possible carry into 

c3 

mov 

[ecx+4],ebx 

; cl=dl+e0+f0 


mov 

eax, [esi+4] 

; EAX = al 


mov 

edx, [edi+4] 

; EDX = bl 


mul 

edx 

; EDX:EAX = al*bl = gl:g0 


add 

ebp,eax 

; EBP = el + fl + gO + carry 


adc 

esp,edx 

; ESP = gl + carry 


mov 

[ecx+8],ebp 

; c2 = el + fl + gO + carry 


mov 

[ecx+12],esp 

; c3 = gl + carry 



; Restore the register state. 

mov esp, [tempESP] 

pop ebp 

pop edi 

pop esi 

pop ebx 

mov esp, ebp 

pop ebp 


ret 

_32bitalu_64x64 ENDP 

_TEXT ENDS 

END 
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To improve performance and substantially reduce code size, the following procedure performs the 
same 64-bit integer multiplication using 64-bit instead of 32-bit operations: 


64bitalu_64x64(int *a, int *b, int *c); 


; TO ASSEMBLE INTO 

*.obj DO THE FOLLOWING 


; ml64.exe -c 

64bitalu 64x64.asm 


/ 

TEXT SEGMENT 



64bitalu 

64x64 PROC 

NEAR 


/ 

; Parameters passed 

into routine: 


; rex = 

- >a 



; rdx = 

- >b 



; r8 

- >c 



r 

mov rax, 

[rex] 

RAX = [aO] 


mul [rdx] 

/ 

Multiply [aO] by [bO] 

such that 


/ 

RDX:RAX = [cl]:[cO]. 


mov [r8], 

rax ; 

Store 128-bit product 

of a and b. 


mov [r8+8], rdx 


ret 

64bitalu_64x64 ENDP 
END 
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3.3 128-Bit Media Instructions and Floating-Point 

Operations 

Optimization 

Use 128-bit media (SSE and SSE2) instructions instead of x87 or 64-bit media (MMX™ and 
3DNow!™ technology) instructions for floating-point operations. 

Rationale 

In 64-bit mode, the processor provides eight additional XMM registers (XMM8-XMM15) for a total 
of 16. These extra registers can substantially reduce register pressure in floating-point code written 
using 128-bit media instructions. 

Although the processor fully supports the x87 and 64-bit media instructions, there are only eight 
registers available to these instructions (ST(0)-ST(7) or MMX0-MMX7, respectively). 
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3.4 32-Bit Legacy GPRs and Small Unsigned Integers 

Optimization 

Use the 32-bit legacy general-purpose registers (EAX through ESI) instead of their 64-bit extensions 
to store unsigned integer values whose range never requires more than 32 bits, even if subsequent 
statements use the 32-bit value in a 64-bit operation. (For example, use ECX instead of RCX until you 
need to perform a 64-bit operation; then use RCX.) 

Rationale 

In 64-bit mode, the machine-language representation of many instructions that operate on 64-bit 
register operands requires a REX prefix byte, which increases the size of the code. However, 
instructions that operate on a 32-bit legacy register operand do not require the prefix and have the 
desirable side-effect of clearing the upper 32 bits of the extended register to zero. For example, using 
the AND instruction on ECX clears the upper half of RCX. 

Caution 

Because the assembler also uses a REX prefix byte to encode the 32-bit sizes of the eight new 64-bit 
general-purpose registers (R8D-R15D), you should only use one of the original eight general- 
purpose registers (EAX through ESI) to implement this technique. 

Example 

The following example illustrates the unnecessary use of 64-bit registers to calculate the number of 
bytes remaining to be copied by an aligned block-copy routine after copying the first few bytes having 
addresses not meeting the routine’s 8-byte-alignment requirements. The first two statements, after the 
program comments, use the 64-bit RIO register—presumably, because this value is later used to 
adjust a 64-bit value in R8—even though the range of values stored in RIO take no more than four bits 
to represent. Using RIO instead of a smaller register requires a REX prefix byte (in this case, 49), 
which increases the size of the machine-language code. 

; Input: 

; RIO = source address (src) 

; R8 = number of bytes to copy (count) 

49 F7 DA neg rlO ; Subtract the source address from 2 A 64. 

49 83 E2 07 and rlO, 7 ; Determine how many bytes were copied separately. 

4D 2B C2 sub r8, rlO ; Subtract the number of bytes already copied from 

; the number of bytes to copy. 
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To improve code density, the following rewritten code uses ECX until it is absolutely necessary to use 
RCX, eliminating two REX prefix bytes: 


F7 

D9 


neg 

ecx 


83 

El 

07 

and 

ecx, 

7 

4C 

2B 

Cl 

sub 

r8. 

rex 


; Subtract the source address from 2 A 32 (the processor 
; clears the high 32 bits of RCX). 

; Determine how many bytes were copied separately. 

; Subtract the number of bytes already copied from 
; the number of bytes to copy. 
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Chapter 4 Instruction-Decoding 

Optimizations 


The optimizations in this chapter are designed to help maximize the number of instructions that the 
processor can decode at one time. 

The instruction fete her of both the AMD Athlon™ 64 and AMD Opteron™ processors reads 16-byte 
packets from the LI instruction cache. These packets are 16-byte aligned. The instruction bytes are 
then merged into a 32-byte pick window. On each cycle, the in-order front-end engine selects up to 
three AMD64 instructions for decode from the pick window. 

This chapter covers the following topics: 


Topic 

Page 

DirectPath Instructions 

72 

Load-Execute Instructions 

73 

Load-Execute Integer Instructions 

73 

Load-Execute Floating-Point Instructions with Floating-Point Operands 

74 

Load-Execute Floating-Point Instructions with Integer Operands 

74 

Branch Targets in Program Hot Spots 

76 

32/64-Bit vs. 16-Bit Forms of the LEA Instruction 

77 

Short Instruction Encodings 

80 

Partial-Register Reads and Writes 

81 

Using LEAVE for Function Epilogues 

83 

Alternatives to SHLD Instruction 

85 

8-Bit Sign-Extended Immediate Values 

87 

8-Bit Sign-Extended Displacements 

88 

Code Padding with Operand-Size Override and NOP 

89 
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4.1 DirectPath Instructions 

Optimization 

Use DirectPath instructions rather than VectorPath instructions. (To determine the type of an 
instruction—either DirectPath or VectorPath—see Appendix C, “Instruction Latencies.”) 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

DirectPath instructions minimize the number of operations per AMD64 instruction, thus providing 
for optimally efficient decode and execution. Up to three DirectPath Single instructions, or one and a 
half DirectPath Double instructions, can be decoded per cycle. VectorPath instructions block the 
decoding of DirectPath instructions. 

The AMD Athlon 64 and AMD Opteron processors implement the majority of instructions used by a 
compiler as DirectPath Single and DirectPath Double instructions. However, assembly writers must 
still take into consideration the use of DirectPath versus VectorPath instructions. 
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4.2 Load-Execute Instructions 

A load-execute instruction is an instruction that loads a value from memory into a register and then 
performs an operation on that value. Many general purpose instructions, such as ADD, SUB, AND, 
etc., have load-execute forms: 

add rax, QWORD PTR [foo] 

This instruction loads the value foo from memory and then adds it to the value in the RAX register. 

The work performed by a load-execute instruction can also be accomplished by using two discrete 
instructions—a load instruction followed by an execute instruction. The following example employs 
discrete load and execute stages: 

mov rbx, QWORD PTR [foo] 
add rax, rbx 

The first statement loads the value foo from memory into the RBX register. The second statement 
adds the value in RBX to the value in RAX. 

The following optimizations govern the use of load-execute instructions: 

• Load-Execute Integer Instructions on page 73. 

• Load-Execute Floating-Point Instructions with Floating-Point Operands on page 74. 

• Load-Execute Floating-Point Instructions with Integer Operands on page 74. 

4.2.1 Load-Execute Integer Instructions 
Optimization 

When performing integer computations, use load-execute instructions instead of discrete load 
and execute instructions. Use discrete load and execute instructions only to avoid scheduler stalls for 
longer executing instructions and to explicitly schedule load and execute operations. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Most load-execute integer instructions are DirectPath decodable and can be decoded at the rate of 
three per cycle. Splitting a load-execute integer instruction into two separate instructions reduces 
decoding bandwidth and increases register pressure, which results in lower performance. 
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4.2.2 Load-Execute Floating-Point Instructions with Floating-Point 
Operands 

Optimization 

*4? When performing floating-point computations using floating-point (not integer) source operands, 
use load-execute instructions instead of discrete load and execute instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Using load-execute floating-point instructions that take floating-point operands improves 
performance for the following reasons: 

• Denser code allows more work to be held in the instruction cache. 

• Denser code generates fewer internal macro-ops, allowing the floating-point scheduler to hold 
more work, which increases the chances of extracting parallelism from the code. 

Example 

Avoid code like this, which uses discrete load and execute instructions: 

movss xmmO, [float_varl] 
movss xmml2, [float_var2] 
mulss xmmO, xmml2 

Instead, use code like this, which uses a load-execute floating-point instruction: 

movss xmmO, [float_varl] 
mulss xmmO, [float_var2] 

4.2.3 Load-Execute Floating-Point Instructions with Integer Operands 
Optimization 

Avoid x87 load-execute floating-point instructions that take integer operands (FIADD, FICOM, 
FICOMP, FIDIV, FIDIVR, FIMUL, FISUB, and FISUBR). When performing floating-point 
computations using integer source operands, use discrete load (FILD) and execute instructions 
instead. 
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Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The load-execute floating-point instructions that take integer operands are VectorPath instructions and 
generate two micro-ops in a cycle, while discrete load and execute intructions enable a third 
DirectPath instruction to be decoded in the same cycle. In some situations, these optimizations can 
also reduce execution time if FILD can be scheduled several instructions ahead of the arithmetic 
instruction in order to cover the FILD latency. 

Example 

Avoid code such as the following, which uses load-execute floating-point instructions that take 
integer operands: 

fid QWORD PTR [foo] ; Push foo onto FP stack [ST(0) = foo]. 

fimul DWORD PTR [bar] ; Multiply bar by ST(0) [ST(0) = bar * foo]. 

fiadd DWORD PTR [baz] ; Add baz to ST(0) [ST(0) = baz + (bar * foo)]. 

Instead, use code such as the following, which uses discrete load and execute instructions: 


fild 

DWORD 

PTR 

[bar] 

; Push bar 

onto FP 

stack. 

fild 

DWORD 

PTR 

[baz] 

; Push baz 

onto FP 

stack. 

fid 

QWORD 

PTR 

[foo] 

; Push foo 

onto FP 

stack. 

fmulp 

St (2) 

St 


; Multiply 

and pop 

[ST (1) = foo * bar, ST(0 

f addp 

St (1) 

St 


; Add and pop [ST(0 

) = baz + (foo * bar)]. 
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4.3 Branch Targets in Program Hot Spots 

Optimization 

In program “hot spots” (as determined by either profiling or loop-nesting analysis), branch targets 
should be placed at or near the beginning of code windows that are 16-byte aligned. The smaller the 
basic block, the more beneficial this optimization will be. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Aligning branch targets maximizes the number of instructions in the pick window and preserves 
instruction-cache space in branch-intensive code outside such hot spots. 
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4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction 

Optimization 

Use the 32-bit or 64-bit forms of the Load Effective Address (LEA) instruction rather than the 16-bit 
form. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The 32-bit and 64-bit LEA instructions are implemented as DirectPath operations with an execution 
latency of only two cycles. The 16-bit LEA instruction, however, is a VectorPath instruction, which 
lowers the decode bandwidth and has a longer execution latency. 
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4.5 Take Advantage of x86 and AMD64 Complex 
Addressing Modes 

Optimization 

When porting from other architectures, or, perhaps, if you are just new to x86 assembly language, 
remember that the x86 architecture provides many complex addressing modes. By building the 
effective address in one instruction, the instruction count can sometimes be reduced, leading to better 
code density and greater decode bandwidth. Refer to the the section on effective addresses in the 
AMD64 Architecture Programmer's Manual Volume 1: Application Programming for more detailed 
infonnation on how effective addresses are formed. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

When building the effective address you sometimes seem to require numerous instructions when there 
is a base address (such as the base of an array) an index and perhaps a displacement. But x86 
architecture can often handle all of this in one instruction. This can lead to reduced code size and 
fewer instructions to decode. As always, attention should be paid to total instruction length, latencies 
and whether or not the instruction choices are DirectPath (fastest) or VectorPath (slower). 

Example 

This first instruction sequence of 5 instructions and a total latency count of 8 can be replaced by one 
instruction. 


Number of Bytes 

Latency 

Instruction 

3 

1 

movl %rl0d,%rlld 

8 

2 

leaq 0x68E35,rcx 

3 

1 

addq %rcx,%rll 

5 

3 

movb (%rll,%rl3),%cl 

2 

1 

cmpb %al,%cl 


The following instruction replaces the functionality of the above sequence. 
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Number of Bytes 

Latency 

Instruction 

8 

4 

cmpb %al,0x68e35(%rl0,%rl3) 


Example 

These two instructions can be replaced by one instruction. 

movl 0x4c65a,%rll 
movl (%rll,%r8,8),%rll 

becomes: 

movl 0x4c65a(,%r8,8),%rll 
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4.6 Short Instruction Encodings 

Optimization 

Use instruction forms with shorter encodings rather than those with longer encodings. For example, 
use 8-bit displacements instead of 32-bit displacements, and use the single-byte form of simple 
integer instructions instead of the 2-byte opcode-ModRM form. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Using shorter instructions increases the number of instructions that can fit into the LI instruction 
cache and increases the average decode rate. 

Example 

Avoid the use of instructions with longer encodings, such as those shown here: 

81 CO 78 56 34 12 add eax, 12345678h ; 2-byte opcode form (with ModRM) 

81 C3 FB FF FF FF add ebx, -5 ; 32-bit immediate value 

OF 84 05 00 00 00 jz labell ; 2-byte opcode, 32-bit immediate value 

Instead, choose instructions with shorter encodings, like these: 

05 78 56 34 12 add eax, 12345678h ; 1-byte opcode form 

83 C3 FB add ebx, -5 ; 8-bit sign-extended immediate value 

74 05 jz labell ; 1-byte opcode, 8-bit immediate value 
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4.7 Partial-Register Reads and Writes 

Optimization 

Avoid partial register reads and writes. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 


Rationale 

In order to handle partial register writes, the processor’s execution core implements a data merging 
scheme. 

In the execution unit, an instruction that writes part of a register merges the modified portion with the 
current state of the other part of the register. Therefore, the dependency hardware can potentially 
force a false dependency on the most recent instruction that writes to any part of the register. 

In addition, an instruction that has a read dependency on any part of a given architectural register has 
a read dependency on the most recent instruction that modifies any part of the same architectural 
register. 


Example 1 

Avoid code such as the following, which writes to only part of a register: 


mov al, 10 ; Instruction 1 

mov ah, 12 ; Instruction 2 has a false dependency on instruction 1. 

; Instruction 2 merges new AH with current EAX register 
; value forwarded by instruction 1. 

Example 2 

Avoid code such as the following, which both reads and writes only parts of registers: 


mov 

bx, 

12h 

mov 

i—1 

hi 

dl 

mov 

bh, 

cl 

mov 

al, 

bl 


; Instruction 1 

; Instruction 2 has a false dependency on the completion 
; of instruction 1. 

; Instruction 3 has a false dependency on the completion 
; of instruction 2. 

; Instruction 4 depends on the completion of instruction 2. 
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Example 3 

Avoid: 

mov al, bl 

Preferred: 

movzx eax, bl 

Example 4 

Avoid: 

mov al, [ebx] 

Preferred: 

movzx eax, byte ptr [ebx] 

Example 5 

Avoid: 

mov al, Olh 

Preferred: 

mov eax, 0000000lh 

Example 6 

Avoid: 

movss xmml, xmm2 

Preferred: 

movaps xmml, xmm2 
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4.8 Using LEAVE for Function Epilogues 

Optimization 

The recommended optimization for function epilogues depends on whether the function allocates 
local variables. 

If the function 

Allocates local variables 
Does not allocate local variables 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Functions That Allocate Local Variables 

The LEAVE instruction is a single-byte instruction and saves 2 bytes of code space over the 
traditional epilogue. Replacing the traditional sequence with LEAVE also preserves decode 
bandwidth. 

Functions That Do not Allocate Local Variables 

Accessing function arguments and local variables directly through ESP frees EBP for use as a 
general-purpose register. 

Background 

The function arguments and local variables inside a function are referenced through a so-called frame 
pointer. In AMD64 code, the base pointer register (rBP) is customarily used as a frame pointer. You 
set up the frame pointer at the beginning of the function using a function prologue: 


Then 

Replace the traditional function epilogue with the LEAVE instruction. 

Do no use function prologues or epilogues. Access function 
arguments and local variables through rSP. 


push 

ebp 

; Save old frame 

pointer. 

mov 

ebp, esp 

; Initialize new 

frame pointer. 

sub 

esp, n 

; Allocate space 

for local variables (only if the 


; function allocates local variables). 
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Function arguments on the stack can now be accessed at positive offsets relative to rBP, and local 
variables are accessible at negative offsets relative to rBP. 

Example 

The traditional function epilogue looks like this: 

mov esp, ebp ; Deallocate local variables (only if space was allocated). 
pop ebp ; Restore old frame pointer. 

Replace the traditional function epilogue with a single LEAVE instruction: 

leave 
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4.9 Alternatives to SHLD Instruction 

Optimization 

Where register pressure is low, replace the SHLD instruction with alternative code using ADD and 
ADC, or SHR and LEA. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Using alternative code in place of SHLD achieves lower overall latency and requires fewer execution 
resources. The 32-bit and 64-bit forms of ADD, ADC, SHR, and LEA are DirectPath instructions, 
while SHLD is a VectorPath instruction. Use of the replacement code optimizes decode bandwidth 
because it potentially enables the simultaneous decoding of a third DirectPath instruction. However, 
the replacement code may increase register pressure because it destroys the contents of one register 
(reg2 in the following examples) whereas the register is preserved by SHLD. 

Example 1 

Replace this instruction: 

shld regl, reg2, 1 

with this code sequence: 

add reg2, reg2 
adc regl, regl 

Example 2 

Replace this instruction: 

shld regl, reg2, 2 

with this code sequence: 

shr reg2, 30 

lea regl, [regl*4:+reg2\ 
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Example 3 

Replace this instruction: 

shld regl, reg2, 3 

with this code sequence: 

shr reg2, 29 

lea regl, [regl*8+reg2] 
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4.10 8-Bit Sign-Extended Immediate Values 

Optimization 

Use 8-bit sign-extended immediate values instead of larger-size values. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Using 8-bit sign-extended immediate values improves code density with no negative affects on the 
processor. 

Example 

Consider this instruction: 

add bx, -5 
Avoid encoding it as: 

81 C3 FF FB 
Instead, encode it as: 

83 C3 FB 
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4.11 8-Bit Sign-Extended Displacements 

Optimization 

Use 8-bit sign-extended displacements for conditional branches. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Using short, 8-bit sign-extended displacements for conditional branches improves code density with 
no negative affects on the processor. 
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4.12 Code Padding with Operand-Size Override and 
NOP 

Optimization 

Use one or more operand-size overrides (66h) and the NOP instruction (90h) to align code and space 
out branches. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 


Rationale 

Occasionally it is necessary to insert neutral code fillers into the code stream (for example, for code¬ 
alignment purposes or to space out branches). Because this filler code is executable, it should take up 
as few execution resources as possible, not diminish decode density, and not modify any processor 
state other than advancing the instruction pointer (rIP). Although there are several possible multibyte 
NOP-equivalent instructions that do not change the processor state (other than rIP), combinations of 
the operand-size override and the NOP instruction work best. 


Example 

Assign code-padding sequences like these and use them to align code and space out branches. These 
sequences are suitable for both 32-bit and 64-bit code, and you can use them on the AMD Athlon 64 
and AMD Opteron processors, as well as seventh-generation AMD Athlon processors: 

NOPl_OVERRIDE_NOP TEXTEQU <DB 090h> 

NOP2_OVERRIDE_NOP TEXTEQU <DB 066h,090h> 

NOP3_OVERRIDE_NOP TEXTEQU <DB 066h,066h,090h> 

NOP4_OVERRIDE_NOP TEXTEQU <DB 066h,066h,066h,090h> 

NOP5_OVERRIDE_NOP TEXTEQU <DB 066h,066h,090h,066h,090h> 

NOP6_OVERRIDE_NOP TEXTEQU <DB 066h,066h,090h,066h,066h,090h> 

NOP7_OVERRIDE_NOP TEXTEQU <DB 066h,066h,066h,090h,066h,066h,090h> 

NOP8_OVERRIDE_NOP TEXTEQU <DB 066h,066h,066h,090h,066h,066h,066h,090h> 

NOP9_OVERRIDE_NOP TEXTEQU <DB 066h,066h,090h,066h,066h,090h,066h,066h,090h> 

For x87 floating-point instructions, a better single-byte padding exists. See “Align and Pack 
DirectPath x87 Instructions” on page 242. 
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Chapter 5 Cache and Memory Optimizations 


The optimizations in this chapter take advantage of the large LI caches and high-bandwidth buses of 
the AMD Athlon™ 64 and AMD Opteron™ processors. 

This chapter covers the following topics: 


Topic 

Page 

Memory-Size Mismatches 

92 

Natural Alignment of Data Objects 

95 

Cache-Coherent Nonuniform Memory Access (ccNUMA) 

96 

Multiprocessor Considerations 

99 

Store-to-Load Forwarding Restrictions 

100 

Prefetch Instructions 

104 

Write-combining 

113 

LI Data Cache Bank Conflicts 

114 

Placing Code and Data in the Same 64-Byte Cache Line 

116 

Sorting and Padding C and C++ Structures 

117 

Sorting Local Variables 

119 

Memory Copy 

120 

Stack Considerations 

122 
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5.1 Memory-Size Mismatches 

Optimization 

Avoid memory-size mismatches when different instructions operate on the same data. When one 
instruction stores and another instruction subsequently loads the same data, keep their operands 
aligned and keep the loads/stores of each operand the same size. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Examples—Store-to-Load-Forwarding Stalls 

The following code examples result in a store-to-load-forwarding stall: 

64-bit (Avoid) 

foo DQ ? 

mov DWORD PTR foo, eax 
mov DWORD PTR foo+4, ebx 
mov rex, QWORD PTR foo 

32-bit (Avoid) 

foo DQ ? 

mov DWORD PTR foo, eax 
mov DWORD PTR foo+4, edx 
fid QWORD PTR foo 

Avoid 

mov foo, eax 
mov foo+4, edx 

movq mmO, foo 

Preferred 

mov foo, eax 

mov foo+4, edx 

movd mmO, foo 

punpckldq mmO, foo+4 


; Assume foo is 8-byte aligned. 

; Store a DWORD to foo. 

; Now store to foo+4. 

; Load a QWORD from foo. 


; Assume foo is 4-byte aligned. 

; Store a DWORD in foo. 

; Store a DWORD in foo+4. 

; Load a QWORD from foo. 
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Preferred If Stores Are Close to the Load 

movd mmO, eax 

mov foo+4, edx 

punpckldq mmO, foo+4 

Examples—Large-to-small Mismatches 

Avoid large-to-small mismatches, as shown in the following code: 

64-bit (Avoid) 

foo DQ ? ; Assume foo is 8-byte aligned. 

mov QWORD PTR foo, rax ; Store a QWORD to foo. 

mov eax, DWORD PTR foo ; Load a DWORD from foo. 

mov edx, DWORD PTR foo+4 ; Load a DWORD from foo+4. 

32-bit (Avoid) 

foo DQ ? ; Assume foo is 4-byte aligned. 

fst QWORD PTR foo ; Store a QWORD in foo. 

mov eax, DWORD PTR foo ; Load a DWORD from foo. 

mov edx, DWORD PTR foo+4 ; Load a DWORD from foo+4. 

Avoid 

movq foo, mmO 

mov eax, foo 
mov edx, foo+4 

Preferred 

movd foo, mmO 
pswapd mmO, mmO 
movd foo+4, mmO 
pswapd mmO, mmO 

mov eax, foo 
mov edx, foo+4 

Preferred If the Contents of MMO are No Longer Needed 

movd foo, mmO 

punpckhdq mmO, mmO 
movd foo+4, mmO 

mov eax, foo 

mov edx, foo+4 
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Preferred If the Stores and Loads are Close Together, Option 1 

movd eax, mmO 
pswapd mmO, mmO 
movd edx, mmO 
pswapd mmO, mmO 

Preferred If the Stores and Loads are Close Together, Option 2 

movd eax, mmO 

punpckhdq mmO, mmO 
movd edx, mmO 
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5.2 Natural Alignment of Data Objects 


Optimization 

Make sure data objects are naturally aligned. An object is naturally aligned if it is located at an 
address that is a multiple of its size. 


Locate this type of object 

At an address evenly divisible by 

Word 

2 

Doubleword 

4 

Quadword 

8 

Ten-byte (for example, TBYTE or REAL10) 

8 (instead of 10) 

Double quadword 

16 


Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

A misaligned store or load operation suffers a minimum one-cycle penalty in the processor’s load- 
store pipeline. Also, using misaligned loads and stores increases the likelihood of encountering a 
store-to-load forwarding pitfall, especially when operating in long mode (64-bit software). (For a 
more detailed discussion of store-to-load forwarding issues, see “Store-to-Load Forwarding 
Restrictions” on page 100.) 

In addition, if the Alignment Mask bit is set in Control Register 0 (CR0), an unaligned memory 
reference may cause an alignment check exception. For more information on this topic, see Volume 2 
of the AMD64 Architecture Programmer s Manual (order# 24593). 
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5.3 Cache-Coherent Nonuniform Memory Access 
(ccNUMA) 

Optimization 

For applications with multiple threads, use OS functions to run a thread on a particular node and let 
that thread allocate the memory that it requires so that the memory used is local to that node. In the 
Microsoft Windows environment, the function to run a thread on a particular node is 

SetThreadAffinityMask(). 

Be sure operating systems are properly configured to support ccNUMA. All versions of Microsoft 
Windows XP for AMD64 and Windows Server for AMD64 support ccNUMA without any changes. 
The 32-bit versions of Windows Server 2003, Enterprise Edition and Windows Server 2003, 
Datacenter Edition require the /PAE boot parameter to support ccNUMA. 

For 64-bit Linux, there may be separate kernels supporting ccNUMA that should be selected. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Most multiple processor systems available today employ a symmetric multiprocessing (SMP) 
architecture. Processors on an SMP platform generally share a common or centralized memory bus, 
having identical memory access latencies regardless of the processor position. Because the processors 
use the same bus and memory, system performance may be negatively affected when bottlenecks 
occur due to increased demands on the single memory bus. Figure 1 shows a simplified block diagram 
for a two processor SMP system. 
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Figure 1. Simple SMP Block Diagram 

The AMD Opteron processor implements a Cache-coherent nonuniform memory access (ccNUMA) 
architecture when two or more processors are connected together on the same motherboard. In a 
ccNUMA design, each processor has its own memory system. When a processor accesses memory on 
its own local memory system, the latency is relatively low, especially when compared to a similar 
SMP system. If a processor accesses memory located on a different processor, then the latency will be 
higher. The phrase ‘non-uniform memory access’ refers to this potential difference in latency. 

In an AMD Opteron processor system, each processor contains its own memory controller. Figure 2 
shows an example of a two processor AMD Opteron system in a ccNUMA configuration. 



Figure 2. AMD Opteron 

Dual-Core AMD Opteron processors and AMD Athlon X2 Dual-Core processors share the on-chip 
integrated memory controller and memory. Two or more AMD dual-core processors still use the 
ccNUMA configuration. Figure 3 illustrates a dual-core AMD Opteron configuration. 
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AMD Opteron™ 
Dual-Core Processor 0 


AMD Opteron™ 
Dual-Core Processor 1 



Figure 3. Dual-Core AMD Opteron™ Processor Configuration 
OS Implications 

An operating system running on an AMD Opteron platform will coordinate and manage the memory 
configuration so that an application does not have to be aware of this memory configuration. Thanks 
to the OS, the platform will simply appear to have one contiguous block of memory regardless of how 
many processors are in the platform. 

Because of the difference in latencies in ccNUMA systems, the OS must make determinations that 
enable the best performance. It would be undesirable, for example, to spawn a thread on a processor 
while allocating the memory space for that thread on a different processor. For such reasons, it is 
important to be aware of the capabilities of the OS being used. Microsoft's Windows Server 2003 
products are ccNUMA aware. The SUSE distribution of 64-bit Linux also has a ccNUMA aware 
kernel for AMD64 processors. 

Windows applications that spawn several threads, where each thread operates on largely independent 
data, might benefit from distributing those threads across several processors and allocating memory 
locally for each thread. This can be accomplished by using the SetThreadAffinityMask( ) function 
and by allocating memory blocks using VirtualAlloc( ) from within the thread that will be heavily 
accessing that memory block. Memory is not actually committed until it is accessed and then it is 
committed to the node that accesses it. For this reason, it is a good idea to initialize that memory 
block using memset() or other code which causes all the pages in the block to be accessed if there is a 
chance another node could access it first. See the Microsoft documentation on MSDN for more 
details (search for SetThreadAffinityMask( )). 
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5.4 Multiprocessor Considerations 

In a multiprocessor system, data within a single cache line that is shared between processors can 
reduce performance. In certain cases (for example, semaphores), this kind of cache-line data sharing 
cannot be avoided, but it should be minimized where possible. 

Data can often be restructured so this does not occur. Cache lines on AMD Athlon 64 and 

AMD Opteron processors are presently 64 bytes, but a scheme that avoids this problem regardless of 

cache-line size makes for more performance-portable code. 

For example, per-thread data can be allocated on the heap (for example, via calls to malloc ()), and 
this is preferred over statically defined shared arrays and variables that are potentially located in a 
single cache line. Furthermore, some software environments even provide special versions of malloc 
that guarantee data alignment to a specified value, and these can be useful in aligning data and 
eliminating unwanted cache line overlap. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 
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5.5 Store-to-Load Forwarding Restrictions 

Store-to-load forwarding refers to the process of a load reading (forwarding) data from the store 
buffer. When this can occur, it improves performance because the load does not have to wait for the 
recently written (stored) data to be written to cache and then read back out again. There are instances 
in the load-store architecture of the AMD Athlon 64 and AMD Opteron processors when a load 
operation is not allowed to read needed data from a store in the store buffer. 

In these cases, the load cannot complete (load the needed data into a register) until the store has 
retired out of the store buffer and written to the data cache. A store-buffer entry cannot retire and 
write to the data cache until every instruction before the store has completed and retired from the 
reorder buffer. 

The implication of this restriction is that all instructions in the reorder buffer, up to and including the 
store, must complete and retire out of the reorder buffer before the load can complete. Effectively, the 
load has a false dependency on every instruction up to the store. 

Due to the significant depth of the LS buffer of the AMD Athlon 64 and AMD Opteron processors, 
any load that is dependent on a store that cannot bypass data through the LS buffer may experience 
significant delays of up to tens of clock cycles, where the exact delay is a function of pipeline 
conditions. 

The following sections describe store-to-load forwarding examples. 

Store-to-Load Forwarding Pitfalls—True Dependencies 

A load is allowed to read data from the store-buffer entry only if all of the following conditions are 
satisfied: 

• The start address of the load matches the start address of the store. 

• The load operand size is equal to or smaller than the store operand size. 

• Neither the load nor the store is misaligned. 

• The store data is not from a high-byte register (AH, BH, CH, or DH). 

The following sections describe common-case scenarios to avoid. In these scenarios, a load has a true 
dependency on an LS2-buffered store, but cannot read (forward) data from a store-buffer entry. 

Narrow-to-Wide Store-Buffer Data-Forwarding Restriction 

If the following conditions are present, there is a narrow-to-wide store-buffer data-forwarding 
restriction: 

• The operand size of the store data is smaller than the operand size of the load data. 

• The range of addresses spanned by the store data covers some subrange of the addresses spanned 
by the load data. 
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Avoid 

mov eax, lOh 

mov WORD PTR [eax], bx ; Word store 

mov ecx, DWORD PTR [eax] ; Doubleword load--cannot forward upper byte 

; from store buffer 

Avoid 

mov eax, lOh 

mov BYTE PTR [eax+3], bl ; Byte store 

mov ecx, DWORD PTR [eax] ; Doubleword load--cannot forward upper byte 

; from store buffer 

Wide-to-Narrow Store-Buffer Data-Forwarding Restriction 

If the following conditions are present, there is a wide-to-narrow store-buffer data-forwarding 
restriction: 

• The operand size of the store data is greater than the operand size of the load data. 

• The start address of the store data does not match the start address of the load data. 

Avoid 

mov eax, lOh 

add DWORD PTR [eax], ebx ; Doubleword store 

mov cx, WORD PTR [eax+2] ; Word load--cannot forward high word 

; from store buffer 

Avoid 

movq [foo], mml ; Store upper and lower half. 

add eax, [foo] ; Fine 

add edx, [foo+4] ; Not good! 

Preferred 

movd [foo], mml 

punpckhdq mml, mml 
movd [foo+4], mml 

add eax, [foo] ; Fine 

add edx, [foo+4] ; Fine 

Misaligned Store-Buffer Data-Forwarding Restriction 

If the following condition is present, there is a misaligned store-buffer data-forwarding restriction: 

• The store or load address is misaligned. For example, a quadword store is not aligned to a 
quadword boundary. 


,- Store lower half. 

; Copy upper half into lower half. 
; Store lower half. 
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A common case of misaligned store-data forwarding involves the passing of misaligned quadword 
floating-point data on the doubleword-aligned integer stack. Avoid the type of code shown in the 
following example: 

mov esp, 24h 

fstp QWORD PTR [esp] ; ESP = 24 

... ; Store occurs to quadword misaligned address, 

fid QWORD PTR [esp] ; Quadword load cannot forward from quadword 

; misaligned 'FSTP [ESP]' store operation. 

High-Byte Store-Buffer Data-Forwarding Restriction 

If the following condition is present, there is a high-byte store-data buffer-forwarding restriction—the 
store data is from a high-byte register (AH, BH, CH, DH). 

Avoid the type of code shown in the following example: 
mov eax, lOh 

mov [eax], bh ; High-byte store 

mov dl, [eax] ; Load cannot forward from high-byte store. 

One Supported Store-to-Load Forwarding Case 

There is one case of a mismatched store-to-load forwarding that is supported by AMD Athlon 64 and 
AMD Opteron processors. The lower 32 bits from an aligned quadword write feeding into a 
doubleword read is allowed, as illustrated in the following example: 

movq [alignedQword], mmO 

mov eax, [alignedQword] 

Store-to-Load Forwarding—False Dependencies 

A load may detect a false dependency on a store-buffer entry if the load does not have a true 
dependency on the most recent store that matches address bits 11-2 of the load. A false match could 
occur on the most recent store that writes somewhere within the same doubleword of memory as the 
load. In addition, a false match could occur if a store address is located at an exact multiple of 
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4-Kbyte pages away from the load address (address bits 47-12 do not match). Avoid the type of code 
shown in the following example: 


mov eax, lOh 
mov [eax], bx 
mov cx, [eax+2] 


mov cx, [eax+4] 


; Word store to address 10 
; Word load to address 12 
; Load detects a false dependency 
; on store because it is in the 
; same doubleword of memory. 

; Word load to address 14 
; Load does not detect a false 
; dependency because it is to a 
; different doubleword of memory. 


Here is another example of the type of code to avoid: 


mov eax, lOh 
mov [eax], bl 
mov [eax+1], cl 
mov dl, [eax] 

; dependency on the second store 
; because it is the most recent 
; store to the same doubleword of 
; memory as the load. 


; First store to DWORD at address lOh 
; Second store to DWORD at address lOh 
; Load detects a false 


Summary of Store-to-Load-Forwarding Pitfalls to Avoid 

To avoid store-to-load-forwarding pitfalls, follow these guidelines: 

• Maintain consistent use of operand size across all loads and stores. Preferably use doubleword or 
quadword operand sizes. 

• Avoid misaligned data references. 

• Avoid narrow-to-wide and wide-to-narrow forwarding cases. 

• When using word or byte stores, avoid loading data from anywhere in the same doubleword of 
memory other than the identical start addresses of the stores. 


Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 
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5.6 Prefetch Instructions 

Optimization 

Where appropriate, use one of the prefetch instructions to increase the effective bandwidth of the 
AMD Athlon 64 and AMD Opteron processors. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Prefetch instructions take advantage of the high bus bandwidth of the AMD Athlon 64 and 
AMD Opteron processors to hide latencies when fetching data from system memory. A prefetch 
instruction initiates a read request of a specified address and reads the entire cache line that contains 
that address. 

AMD Athlon 64 and AMD Opteron processors perform three types of prefetches: 


Prefetch type 

Description 

Load 

Reads the data into the LI data cache; the data is later evicted to the L2 cache. The 
following instructions perform load prefetches: PREFETCH, PREFETCHTO, 
PREFETCHT1, and PREFETCHT2. 

Store 

Reads the data into the LI data cache and marks the data as modified; the data is 
later evicted to the L2 cache. The PREFETCHW instruction performs a store prefetch. 

Nontemporal 

The PREFETCHNTA instruction performs a nontemporal prefetch. The data is read 
into the LI data cache; to avoid cache pollution, when a PREFETCHNTA misses in 
the L2 cache and reads from memory, the data is never evicted to the L2 cache. When 
a PREFETCHNTA hits in the L2 cache, the data is evicted back to the L2 cache. AMD 
Athlon 64 and AMD Opteron processors prior to Revision E read data into one way of 
the LI cache when the PREFETCHNTA instruction was used. Revision E processors 
read PREFETCHNTA data into both ways of the LI cache. 


The prefetch instructions can be used anywhere, in any type of code. The use of prefetch instructions 
is not affected by the values of Control Register 0 (CRO) bits, such as CRO.EM and CRO.TS. 

Prefetching versus Preloading 

In code that makes irregular memory accesses rather than sequential accesses, an ordinary MOV 
instruction is the best way to load data. But in situations where sequential addresses are read, prefetch 
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instructions can improve performance. Prefetch instructions only update the LI data cache and do not 
update an architectural register. This uses one less register compared to a load instruction. 

Unit-Stride Access 

Large data sets typically require unit-stride access to ensure that all data pulled in by a prefetch 
instruction is actually used. Large data sets make use of all data that is read from memory, rather than 
using only a sparse subset of the memory. If necessary, you should reorganize algorithms or data 
structures to allow unit-stride access. For a definition of unit-stride access, see “Definitions” on 
page 110. 

Hardware Prefetching 

The AMD Athlon 64 and AMD Opteron processors implement a hardware prefetching mechanism. 
The prefetched data is loaded into the L2 cache. The hardware prefetcher works most efficiently when 
data is accessed on a cache-line-by-cache-line basis (that is, without skipping cache lines). Cache 
lines on current AMD Athlon 64 and AMD Opteron processors are 64 bytes, but cache-line size is 
implementation dependent. 

The hardware prefetcher prefetches data that is accessed in an ascending or descending order on a 
cache-line-by-cache-line basis. For example, when the hardware prefetcher detects an access to cache 
line / followed by an access to cache line / + 1, it initiates a prefetch of cache line / + 3. Accessing 
data in increments larger than 64 bytes may fail to trigger the hardware prefetcher because cache lines 
are skipped. In these cases, software-prefetch instructions should be employed. Note that in some 
earlier revisions of the AMD Athlon 64 and AMD Opteron processors the hardware prefetcher would 
only detect ascending accesses. 

In some cases, using prefetch instructions on processors with hardware prefetching may slightly 
reduce performance. In these cases, it may be necessary to remove the prefetch instructions. All 
current AMD Athlon 64 and AMD Opteron processors have hardware prefetching mechanisms. 

PREFETCH/W versus PREFETCHNTA/T0/T1/T2 

PREFETCHNTA, PREFETCHTO, PREFETCHT1, and PREFETCHT2 are SSE instructions and are 
processor-implementation dependent. For the AMD Athlon 64 and AMD Opteron processors, data 
that is prefetched with the PREFETCHNTA instruction is not placed into the L2 cache when it is 
evicted unless it was originally in L2 when prefetched. 

PREFETCHNTA is intended for non-temporal data that will not be needed again soon. 
PREFETCHNTA should also be used when reading arrays that are so large that they are larger than 
the L2 cache. Because of their size, such large arrays will not be available in L2 even if they are 
needed again, and by feeding them through the L2 cache, other possibly useful data will also be 
evicted from L2. 

Note: The L2 cache size of the processor can be determined by using the CPUID instruction. 
Chapters 5 and 9 show examples of how to use the PREFETCHNTA instruction. 
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Note: PREFETCHNTA should NOT be used for large arrays that are only being written, not read. 
In such cases, write-combining stores should be used. (See “Write-combining ” on page 113, 
Appendix B “Implementation of Write-Combining'’ on page 263, and “Write-Combining” in 
Volume 2 of the AMD64 Architecture Programmer s Manual (order no. 24593).) 

Current AMD Athlon 64 and AMD Opteron processors implement the PREFETCHTO, 
PREFETCHT1 and PREFETCHT2 instructions in exactly the same way as the PREFETCH 
instructions. That is, the data is brought into the LI data cache. This functionality could be changed in 
future implementations. 

PREFETCHW versus PREFETCH 

Code that intends to modify the cache line that is brought in through prefetching should use the 
PREFETCHW instruction. PREFETCHW gives a hint to the AMD Athlon 64 and AMD Opteron 
processors of an intent to modify the cache line. The AMD Athlon 64 and AMD Opteron processors 
mark the cache line being read by PREFETCHW as modified. Using PREFETCHW can save 
additional cycles compared to PREFETCH, and avoid the subsequent cache state change caused by a 
write to the prefetched cache line. Only use PREFETCHW if there is a write to the same cache line 
afterwards. 

Write-Combining Usage 

Use write-combining instructions instead of PREFETCHW in situations where all of the following 
conditions are true: 

• The code will overwrite one or more complete cache lines with new data. 

• The new data will not be used again soon. 

Write-combining instructions include the SSE and SSE2 instructions MOVNTDQ, MOVNTI, 
MOVNTPS, and MOVNTPD. They also include the MMX instruction MOVNTQ. 

Write-combining instructions can dramatically improve memory-write performance. They write data 
directly to memory through write-combining buffers, bypassing the cache. This is faster than 
PREFETCHW because data does not need to be initially read from memory to fill the cache lines, 
only to be completely overwritten shortly thereafter. The new data is simply written to memory, 
replacing the old data in memory, so no memory read is performed. 

One application where write-combining is useful, often in conjunction with prefetch instructions, is in 
copying large blocks of memory. 

Note: The write-combining instructions are not recommended or necessary for write-combined 
memory regions since the processor will automatically combine writes for those regions. 
Write-combine memory types are indicated through the MTRRs and the page-attribute table 
(PAT). 

Note: For best performance, do not mix write-combining instructions on a cache line with non¬ 
write-combining store instructions. 
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For more information on write-combining, see Appendix B, “Implementation of Write-Combining.” 

Multiple Prefetches 

Programmers can initiate multiple outstanding prefetches on the AMD Athlon 64 and AMD Opteron 
processors. The AMD Athlon 64 and AMD Opteron processors can have a theoretical maximum of 
eight outstanding prefetches, but in practice the number is usually smaller. When all resources are 
filled by various memory read requests, the processor waits until resources become free before 
processing the next request. Multiple prefetch requests are essentially handled in order, prefetching 
data in the order that it is needed. 

The following example shows how to initiate multiple prefetches when traversing more than one 
array. 

Example—Multiple Prefetches 

.CODE 
. K3D 
.686 


Original C code: 

#define LARGE_NUM 65536 
#define ARR SIZE (LARGE NUM*8) 


double array_a[LARGE_NUM]; 
double array_b[LARGE_NUM]; 
double array_c[LARGE_NUM]; 
int i ; 

for (i = 0; i < LARGE_NUM; 
a [i] = b [i] * c [i] ; 

} 

mov edx, (-LARGE_NUM) 
mov eax, OFFSET array_a 
mov ebx, OFFSET array_b 
mov ecx, OFFSET array_c 


i++) { 


; Use biased index. 

; Get address of array_a. 
; Get address of array_b. 
; Get address of array_c. 


loop: 


prefetchw [eax+256] 
prefetch [ebx+256] 
prefetch [ecx+256] 


Four cache lines ahead 
Four cache lines ahead 
Four cache lines ahead 


fid QWORD PTR 
fmul QWORD PTR 
fstp QWORD PTR 
fid QWORD PTR 
fmul QWORD PTR 
fstp QWORD PTR 
fid QWORD PTR 
fmul QWORD PTR 


[ebx+edx*8+ARR_SIZE] 

[ecx+edx*8+ARR_SIZE] 

[eax+edx*8+ARR_SIZE] 

[ebx+edx*8+ARR_SIZE+8] 

[ecx+edx*8+ARR_SIZE+8] 

[eax+edx*8+ARR_SIZE+8] 

[ebx+edx*8+ARR_SIZE+16] 

[ecx+edx*8+ARR SIZE+16] 


; b [i] 

; b [i] * c [i] 

; a [i] = b [i] * c [i] 

; b [i + 1] 

; b[i + 1] * c [i + 1] 

; a [i + 1] = b[i + l] * c [i + 1] 
; b[i+2] 

; b[i+2]*c [i + 2] 
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f stp 

QWORD 

PTR 

[eax+edx*8+ARR 

SIZE+16] 

a[i+2] 

= 

[i+2] 

* c [i+2] 

fid 

QWORD 

PTR 

[ebx+edx*8+ARR 

SIZE+24] 

b[i+3] 




fmul 

QWORD 

PTR 

[ecx+edx*8+ARR 

SIZE+24] 

b[i+3] 

9r 

c [i + 3] 


f stp 

QWORD 

PTR 

[eax+edx*8+ARR 

SIZE+24] 

a[i+3] 

= 

b[i+3] 

* c [i + 3] 

fid 

QWORD 

PTR 

[ebx+edx*8+ARR 

SIZE+32] 

b[i+4] 




fmul 

QWORD 

PTR 

[ecx+edx*8+ARR 

SIZE+32] 

b[i+4] 

k 

c[i+4] 


f stp 

QWORD 

PTR 

[eax+edx*8+ARR 

SIZE+32] 

a[i+4] 

= 

b[i+4] 

* c [i+4] 

fid 

QWORD 

PTR 

[ebx+edx*8+ARR 

SIZE+40] 

b[i+5] 




fmul 

QWORD 

PTR 

[ecx+edx*8+ARR 

SIZE+40] 

b[i+5] 

-k 

c[i+5] 


f stp 

QWORD 

PTR 

[eax+edx*8+ARR 

SIZE+40] 

a[i+5] 

= 

b[i+5] 

* c [i + 5] 

fid 

QWORD 

PTR 

[ebx+edx*8+ARR 

SIZE+48] 

b[i+6] 




fmul 

QWORD 

PTR 

[ecx+edx*8+ARR 

SIZE+48] 

b[i+6] 

9r 

c[i+6] 


f stp 

QWORD 

PTR 

[eax+edx*8+ARR 

SIZE+48] 

a[i+6] 

= 

b[i+6] 

* c [i + 6] 

fid 

QWORD 

PTR 

[ebx+edx*8+ARR 

SIZE+56] 

b[i+7] 




fmul 

QWORD 

PTR 

[ecx+edx*8+ARR 

SIZE+56] 

b[i+7] 

9r 

c[i+7] 


f stp 

QWORD 

PTR 

[eax+edx*8+ARR 

SIZE+56] 

a[i+7] 

= 

b[i+7] 

* c [i + 7] 

add 

edx, 8 

Compute next 8 

products 





j nz 

loop 

/ 

until none left. 






END 

The following optimization rules are applied to this example: 

• Partially unroll loops to ensure that the data stride per loop iteration is equal to the length of a 
cache line. This avoids overlapping PREFETCH instructions and thus makes optimal use of the 
available number of outstanding prefetches. 

• Because the array array a is written rather than read, use PREFETCHW instead of PREFETCH 
to avoid overhead for switching cache lines to the correct state. The prefetch distance is optimized 
such that each loop iteration is working on three cache lines while active prefetches bring in the 
next cache lines. 

• Reduce index arithmetic to a minimum by use of complex addressing modes and biasing of the 
array base addresses in order to cut down on loop overhead. 

Determining Prefetch Distance 

When determining how far ahead to prefetch, the basic guideline is to initiate the prefetch early 
enough so that the data is in the cache by the time it is needed, under the constraint that there can not 
be more than eight prefetches in flight at any given time. 

To determine the optimal prefetch distance, use empirical benchmarking when possible. Prefetching 
three or four cache lines ahead (192 or 256 bytes) is a good starting point and usually gives good 
results. Trying to prefetch too far ahead impairs performance. 

Memory-Limited versus Processor-Limited Code 

Software prefetching can help to hide the memory latency, but it can not increase the total memory 
bandwidth. Many loops are limited by memory bandwidth rather than processor speed, as shown in 
Figure 4. In these cases, the best that software prefetching can do is to ensure that enough memory 
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requests are “in flight” to keep the memory system busy all of the time. The AMD Athlon 64 and 
AMD Opteron processors support a maximum of eight concurrent memory requests to different cache 
lines. Multiple requests to the same cache line count as only one towards this limit of eight. 
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Figure 4. Memory-Limited Code 


Code that performs many computations on each cache line is limited by processor speed rather than 
memory bandwidth, as shown in Figure 5. In this case, the goal of software prefetching is just to 
ensure that the memory data is available when the processor needs it. As the processor speed 
increases, the optimal prefetch distance increases until the memory bandwidth becomes the limiting 
factor. 

For an example of how to use software prefetching in processor-limited code, see “Structuring Code 
with Prefetch Instructions to Hide Memory Latency” on page 200. 
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Figure 5. Processor-Limited Code 
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Definitions 

Unit-stride access refers to a memory access pattern where consecutive memory accesses are made to 
consecutive array elements, in ascending or descending order. If the arrays are made of elemental 
types, then they imply adjacent memory locations as well. For example: 

char j, k[MAX]; 

for (i = 0; i < MAX; i++) { 

j += k[i]; // Every byte is used. 

}” 

double x, y[MAX]; 

for (i = 0; i < MAX; i++) { 

x += y[i]; // Every byte is used. 

r 

Exception to Unit Stride 

The unit-stride concept works well when stepping through arrays of elementary data types. In some 
instances, unit stride alone may not be sufficient to determine how to use the PREFETCH instruction 
properly. For example, assume that there is a vertex structure of 256 bytes and the code steps through 
the vertices in unit stride, but using only the x, y, z, w components, each being of type float (for 
example, the first 16 bytes of each vertex). In this case, the prefetch distance obviously should be 
some function of the data size structure (for a properly chosen n): 

prefetch [eax+n*structure_size] 

add eax, structure_size 

You should experiment to find the optimal prefetch distance; there is no formula that works for all 
situations. 

Data Stride per Loop Iteration 

Assuming unit-stride access to a single array, the data stride of a loop (the loop stride) refers to the 
number of bytes accessed in the array per loop iteration. For example: 

fldz 

add_loop: 

fadd QWORD PTR [ebx*8+base_address] 

dec ebx 

jnz add_loop 

The data stride of the above loop is eight bytes. In general, for optimal use of prefetching, the data 
stride per iteration is the length of a cache line (64 bytes in the AMD Athlon 64 and AMD Opteron 
processors). If the loop stride is smaller, unroll the loop enough to use a whole cache line of data per 
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iteration. However, unrolling the loop may not be feasible if the original loop stride is very small (for 
example, only two bytes). 

Prefetch at Least 64 Bytes Away from Surrounding Stores 

The prefetch instructions can be affected by false dependencies on stores. If there is a store to an 
address that matches a request, that request (the prefetch instruction) may be blocked until the store is 
written to the cache. Therefore, code should prefetch data that is located at least 64 bytes away from 
any surrounding store’s data address. 
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5.7 Streaming-Store/Non-Temporal Instructions 

Optimization 

Use streaming store instructions such as MOVNTPS and MOVNTQ when writing arrays or buffers 
which do not need to reside in cache. These instructions allow the processor to perform a write 
without first reading the data from memory or other processor's caches. This saves the time needed to 
read the cache line, and also prevents evicting data from the cache which may be needed. This can be 
a significant performance advantage. These instructions are available in most compilers using inline 
assembly or intrinsics. Routines 5 and 6 in Section 5.13, “Appropriate Memory Copying Routines” 
illustrate using the combination of streaming store instructions with the PREFETCHNTA instruction 
to optimize memory copy routines. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Streaming store instructions are also sometimes called write-combining instructions. In order to 
improve system performance, the AMD Athlon 64 and AMD Opteron processors aggressively 
combine multiple memory-write cycles of any data size that address locations within a 64-byte cache- 
line-aligned write buffer if a streaming-store instruction is used. This combining is accomplished with 
write-combine buffers. The number of write-combine buffers is processor-implementation dependent. 
Be sure to refer to Appendix B for much more detailed information on write-combining. 

Be sure to follow the last streaming-store instruction in a block of code with the MFENCE instruction 
to assure that all of the write-combine buffers are written to memory. 

Streaming Store instructions are also discussed in “Write-Combining Usage” on page 106. Also see 
Appendix B, "Implementation of Write-Combining." For more information on write-combining, see 
"Write-Combining" in the AMD64 Architecture Programmer's Manual Volume 2: System 
Programming (order# 24593). 
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5.8 Write-combining 

Optimization 

Operating-system, device-driver, and BIOS programmers should take advantage of the write¬ 
combining capabilities of the AMD Athlon 64 and AMD Opteron processors. 

For details, see Appendix B, “Implementation of Write-Combining.’’ For more information on write¬ 
combining, see “Write-Combining” in Volume 2 of the AMD64 Architecture Programmer s Manual 
(order no. 24593). 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

In order to improve system performance, the AMD Athlon 64 and AMD Opteron processors 
aggressively combine multiple memory-write cycles (of any data size) that address locations within a 
64-byte cache-line-aligned write buffer. 
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5.9 LI Data Cache Bank Conflicts 

Optimization 

Utilize pair loads that do not have a bank conflict in the LI data cache to improve load thoughput. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Fields Used to Address the Multibank LI Data Cache 

The LI data cache is a multibank design consisting of 8 banks total, where each bank is 8 bytes wide. 
To address the LI data cache, the processor uses fields within the address as shown in the following 
diagram: 




14 

6 5 

3 2 0 



1-Byte 

Index - 


- Bank 


How to Know If a Bank Conflict Exists 

The existence of a bank conflict between two neighboring loads depends on their bank and index 
values: 


When the bank is 

And the index is 

Then a bank conflict 

Different 

Either the same or different 

Does not exist 

The same 

The same 

Does not exist 

The same 

Different 

Exists 


In other words, with common data types, consecutive array elements cannot have a bank conflict. If 
the array elements are 4 bytes or less, the two loads are to the same index and the same bank, and no 
conflict occurs. If the array elements are 8 bytes, the loads are to the same index but different banks, 
so a bank conflict does not occur either. 
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Rationale 

Loads are served by the LI data cache in program order, but the number of loads that the processor 
can perform in one cycle depends on whether a bank conflict exists between the loads: 


When a bank conflict 

Then the number of loads the processor can perform per cycle is 

Exists 

1 

Does not exist 

2 


Therefore, pairing loads that do not have a bank conflict helps maximize load throughput. 

Example 

Avoid code like this, where two loads without a bank conflict are separated by other instructions: 

fid qword ptr [eax] 
fmul qword ptr [ebx] 
faddp st (3) , st 
fid qword ptr [eax+8] 
fmul qword ptr [ebx+8] 
faddp st (2) , st 

Instead, rearrange the two loads so they appear as a pair: 


fid 

qword 

ptr 

[eax] 

fid 

qword 

ptr 

[eax+8] 

fmul 

qword 

ptr 

[ebx+8] 

f addp 

st (2) , 

, st 


fmul 

qword 

ptr 

[ebx] 

f addp 

st (3) , 

, st 
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5.10 Placing Code and Data in the Same 64-Byte Cache 
Line 

Optimization 

Avoid placing code and data together within a cache line, especially if the data becomes 
modified. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Sharing code and data in the same 64-byte cache line may cause the LI caches to thrash 
(unnecessarily cast out code or data) in order to maintain coherency between the separate instruction 
and data caches. The AMD Athlon 64 and AMD Opteron processors have a cache-line size of 64 
bytes. 

For example, consider that a memory-indirect JMP instruction may have the data for the jump table 
residing in the same 64-byte cache line as the JMP instruction. This mixing of code and data in the 
same cache line results in lower performance. 

Do not place critical code at the border between 32-byte-aligned code segments and data segments. 
Code at the beginning or end of a data segment should be executed as infrequently as possible or 
padded. 

In summary, avoid self-modifying code and storing data in code segments. 
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5.11 Sorting and Padding C and C++ Structures 

Optimization 

Sort and pad C and C++ structures to achieve natural alignment. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

By sorting and padding structures at the source-code level, if the first member of a structure is 
naturally aligned, then all other members are naturally aligned as well. This allows, for example, 
arrays of structures to be perfectly aligned. 

Sorting and Padding C and C++ Structures 

To sort and pad a C or C++ structure, follow these steps: 

1. Sort the structure members according to their type sizes, declaring members with larger type sizes 
ahead of members with smaller type sizes. 

2. Pad the structure so the size of the structure is a multiple of the largest member’s type size. 

Example 

Consider the following structure declaration in a C function: 

struct { 

char a[5]; 
long k; 
double x; 

} baz ; 

Instead of allocating the members in the order in which they are declared, allocate them from lower to 
higher addresses in the following order and add padding: 

x, k, a [4], a [3], a [2], a [1] , a[0], pad_byte6, . . . , pad_byteO 
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Related Information 

For information on sorting and padding C and C++ structures at the C-source level, see “Sorting and 
Padding C and C++ Structures” on page 39. 
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5.12 Sorting Local Variables 

Optimization 

Sort local variables according to their type sizes, allocating those with larger type sizes ahead of those 
with smaller type sizes. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

If the first variable is allocated for natural alignment, all other variables are allocated contiguously in 
the order they are declared and are naturally aligned without any padding. 

Example 

Consider the following declarations in a C function: 

short ga, gu, gi; 
long foo, bar; 
double x, y, z[3] ; 
char a, b; 
float baz; 

Instead of allocating the variables in the order in which they are declared, allocate them from lower to 
higher addresses in the following order: 

x, y, z[2] , z [1], z[0] , foo, bar, baz, ga, gu, gi, a, b 

Related Information 

For information on sorting local variables at the C-source level, see “Sorting Local Variables” on 
page 41. 
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5.13 Memory Copy 

Optimization 

For a very fast general purpose memory copy routine, call the libc memcpyO function included 
with the Microsoft or gee tools. This function features optimizations for all block sizes and 
alignments. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The memcpyO routines included with recent compilers from Microsoft and gee feature optimizations 
for all block sizes and alignments for AMD Athlon 64 and AMD Opteron processors. 

Copying Small Data Structures 

Use inline assembly code to copy a small data structure in cache. Use an unrolled series of MOV 
instructions. Alternate loads and stores in sequences such as load/store/load/store routines, or use 
load/load/store/store sequences for even better performance. Align the destination (and source) if 
possible. 

Example 1 

The following 64-bit example copies 18 bytes of data: 

; rsi = source 
; rdi = destination 

mov r8, [rsi] 

mov r9, [rsi+8] 

mov [rdi], r8 

mov [rdi+8], r9 

mov r8w, [rsi+16] 

mov [rdi+16], r8w 

Example 2 

The following example illustrates how to copy blocks of 32 bytes and larger, in cache. This code 
performs best when the source and destination addresses are 8-byte aligned. Align the destination 


; 8 bytes of source 
; next 8 bytes of source 
; write 8 bytes 
; write next 8 
; read two bytes "r8 word" 
; write the last 2 bytes 
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before starting a copy, especially for large blocks. To write data directly to main memory, bypassing 
the cache, use the MOVNTI instruction instead of MOV for the four store instructions. 

; rsi = source 
; rdi = destination 
; ecx = byte count 

mov eax, ecx 

shr eax, 5 

jz done_32 

align 16 ; align the loop to a 16-byte fetch boundary 

copy_32_bytes: 

mov r8, [rsi] ; read 8 bytes 

mov r9, [rsi + 8] ; it 1 s a bit faster to pair two reads 

add rsi, 32 ; update source pointer 

mov [rdi], r8 ; store 8 bytes 

mov [rdi+8], r9 ; again, pair 2 stores for slight perf gain 

add rdi, 32 ; update destination pointer 

mov r8, [rsi-16] ; loop is unrolled 4 reads, 4 writes 

mov r9, [rsi-8] ; 4-way unroll hides latency of adds and dec 

dec eax ; decrement data counter (32 bytes) 

mov [rdi-16], r8 ; store more bytes 

mov [rdi-8], r9 ; store last 8 bytes 

jnz copy_32_bytes 

done_32: 

(copy any remaining bytes) 
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5.14 Stack Considerations 

Make sure the stack is suitably aligned for the local variable with the largest base type. Then, using 
the technique described in “Sorting and Padding C and C++ Structures” on page 117, all variables can 
be properly aligned with no padding. 

Application 

This optimization applies to: 

• 32-bit software 

Extend Arguments to 32 Bits Before Pushing onto Stack 

Function arguments smaller than 32 bits should be extended to 32 bits before being pushed onto the 
stack, which ensures that the stack is always doubleword aligned on entry to a function. 

If a function has no local variables with a base type larger than a doubleword, no further work is 
necessary. If the function does have local variables whose base type is larger than a doubleword, 
insert additional code to ensure proper alignment of the stack. For example, the following code 
achieves quadword alignment: 

prologue: 
push ebp 
mov ebp, esp 

sub esp, SIZE_OF_LOCALS ; Size of local variables 
and esp, -8 

... ; Push registers that need to be 

epilogue: ; Pop register that needed to be 

leave 
ret 

With this technique, function arguments can be accessed through EBP, and local variables can be 
accessed through ESP. Save and restore EBP between the prologue and the epilogue to keep it free for 
general use. 

Optimized Stack Usage 

It is sometimes possible to improve performance in frequently executed routines by altering the way 
variables and parameters are passed and accessed on the stack. Replacing PUSH and POP instructions 
with MOV instructions can reduce stack pointer dependencies and uses fewer execution resources. 
This optimization is usually most effective in smaller routines. Excessive use of this optimization can 
result in increased code size as MOV instructions are considerably larger than PUSH and POP 
instructions. 


preserved. 
preserved. 
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5.15 Cache Issues when Writing Instruction Bytes to 
Memory 

Optimization 

When writing data consisting of instructions for future execution to memory use streaming store 
(write-combining) instructions such as MOVNTDQ and MOVNTI. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

This optimization pertains to software that writes executable instructions to memory for subsequent 
execution, such as might be done by a just-in-time compiler. If normal store instructions are used to 
write the code to memory, then the cache lines will be in a modified state (either in LI data cache or in 
L2). When the processor eventually tries to execute the code, it will miss in the instruction cache. 
Because the instruction cache cannot contain cache lines that are in a modified state, the data must be 
flushed to memory before it can be fetched into the instruction cache. This unneccesarily evicts 
possibly useful information from the caches. By using write-combining instructions, the contents of 
the cache is preserved with no performance penalty, and this possibly provides a performance 
improvement. 
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5.16 Interleave Loads and Stores 

When loading and storing data as in a copy routine, the organization of the sequence of loads and 
stores can affect performance. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the 
following pattern—Load, Store, Load, Store, Load, Store, etc. This enables the processor to maxi¬ 
mize the load/store bandwidth. 

If using MMX loads and stores in 32-bit mode, the loads and stores should be arranged in the 
following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc. 

Example 

The following example illustrates a sequence of 128-bit loads and stores: 


movdqa 

xmmO, 

[rdx+r8*8] 

; Load 

movntdq 

[rcx+ 

r8*8] ,xmmO 

; Store 

movdqa 

xmml 

,[rdx+r8*8+16] 

; Load 

movntdq 

[rcx+ 

r8*8+16],xmml 

; Store 
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Chapter 6 Branch Optimizations 


The optimizations in this chapter help improve branch prediction and minimize branch penalties. 

In This Chapter 

This chapter covers the following topics: 


Topic 

Page 

Density of Branches 

126 

Two-Byte Near-Return RET Instruction 

128 

Branches That Depend on Random Data 

130 

Pairing CALL and RETURN 

132 

Recursive Functions 

133 

Nonzero Code-Segment Base Values 

135 

Replacing Branches with Computation 

136 

The LOOP Instruction 

141 

Far Control-Transfer Instructions 

142 
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6.1 Density of Branches 

Optimization 

When possible, align branches such that they do not cross a 16-byte boundary. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The AMD Athlon™ 64 and AMD Opteron™ processors have the capability to cache branch- 
prediction history for a maximum of three near branches (CALL, JMP, conditional branches, or 
returns) per 16-byte fetch window. A branch instruction that crosses a 16-byte boundary is counted in 
the second 16-byte window. Due to architectural restrictions, a branch that is split across a 16-byte 
boundary cannot dispatch with any other instructions when it is predicted taken. Perform this 
alignment by rearranging code; it is not beneficial to align branches using padding sequences. 

The following branches are limited to three per 16-byte window: 

j cc rel 8 
j cc rel 32 
j mp rel 8 
jmp rel 3 2 
j mp reg 
jmp WORD PTR 
jmp DWORD PTR 
call rell6 
call r/ml6 
call rel32 
call r/m32 

Coding more than three branches in the same 16-byte code window may lead to conflicts in the 
branch target buffer. To avoid conflicts in the branch target buffer, space out branches such that three 
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or fewer exist in a given 16-byte code window. For absolute optimal performance, try to limit 
branches to one per 16-byte code window. Avoid code sequences like the following: 

ALIGN 16 
labe!3: 


call 

labell 

; 1st 

branch 

in 

16-byte 

code 

window 

jc 

label3 

; 2nd 

branch 

in 

16-byte 

code 

window 

call 

label2 

; 3rd 

branch 

in 

16-byte 

code 

window 

j nz 

label4 

; 4 th 

branch 

in 

16-byte 

code 

window 



; Cannot be predicted. 




If there is a jump table that contains many frequently executed branches, pad the table entries to 
8 bytes each to assure that there are never more than three branches per 16-byte block of code. 

Only branches that have been taken at least once are entered into the dynamic branch prediction, and 
therefore only those branches count toward the three-branch limit. 
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6.2 Two-Byte Near-Return RET Instruction 

Optimization 

Use of a two-byte near-return can improve performance. The single-byte near-return (opcode C3h) of 
the RET instruction should be used carefully. Specifically, avoid the following two situations: 

• Any kind of branch (either conditional or unconditional) that has the single-byte near-return RET 
instruction as its target. See “Examples.” 

• A conditional branch that occurs in the code directly before the single-byte near-return RET 
instruction. See “Examples.” 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The processor is unable to apply a branch prediction to the single-byte near-retum form (opcode C3h) 
of the RET instruction. 

The easiest way to assure the utilization of the branch prediction mechanism is to use a two-byte RET 
instruction. A two-byte RET has a REP instruction inserted before the RET, which produces the 
functional equivalent of the single-byte near-return RET instruction, but is not affected by the 
prediction limitations outlined above. To use a two-byte RET, define a text macro named repret and 
use it instead of the RET instruction to force the intended object code. 

REPRET TEXTEQU <DB 0F3h, 0C3h> 

Examples 

Avoid branches in which the target of the branch is a single-byte near-retum: 

jmp label ; Jump to a single-byte near-return RET instruction, 
label: 

ret ; RET is potentially mispredicted. 

Avoid branches that immediately precede a single-byte near-return: 

jz label ; Conditional branch is not taken, 
ret ; RET is a fall-through instruction, 

; potentially mispredicted. 
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If possible, move an existing instruction, such as a POP instruction that is part of the function 
epilogue, so that it is inserted between the branch and the RET instruction: 

jz label 

pop ebp ; Pad with at least one non-branch instruction, 
ret 

If no existing instruction is available for this purpose, then insert a NOP instruction to provide the 
necessary padding or, better still, use the recommended two-byte version of RET. 
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6.3 Branches That Depend on Random Data 

Optimization 

Avoid conditional branches that depend on random data, as these branches are difficult to predict. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Suppose a piece of code receives a random stream of characters “A” through “Z” and branches if the 
character is before “M” in the collating sequence. Data-dependent branches acting upon basically 
random data cause the branch-prediction logic to mispredict the branch about 50% of the time. 

If possible, design branch-free alternative code sequences that result in shorter average execution 
time. This technique is especially important if the branch body is small. 

Examples 

The following examples illustrate this concept using the CMOVxr instruction. 


Signed Integer ABS Function (x = labs(x)) 


mov 

ecx, 

[x] 

; Load value. 

mov 

ebx, 

ecx 

; Save value. 

neg 

ecx 


; Negate value. 

cmovs 

ecx, 

ebx 

; If negated value is negative, select value 

mov 

[x] , 

ecx 

; Save labs result. 


Unsigned Integer min Function (z = x < y ? x : y) 


mov 

eax, 

[x] 

; Load 

x value. 



mov 

ebx, 

[y] 

; Load 

y value. 



cmp 

eax, 

ebx 

; EBX < 

= EAX ? 

CF = 0 : CF = 

1 

cmovnc 

eax, 

ebx 

; EAX = 

(EBX <= 

EAX) ? EBX : 

EAX 

mov 

[z] , 

eax 

; Save 

min(X,Y) 
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Conditional Write 

// C code: 

int a, b, i, dummy, c[BUFSIZE]; 

if (a < b) { 
c [i++] = a; 

} 


; Assembly code: 

lea esi, [dummy] ; kdummy 

xor ecx, ecx ; i = 0 


lea 

edi, 

[c+ecx*4] 

; &c[i] 

lea 

edx, 

[ecx+1] 

; i + + 

cmp 

eax, 

ebx 

; a < b ? 

cmovge 

edi, 

esi 

; ptr = (a >= b) ? &dummy 

cmovl 

ecx, 

edx 

; a < b ? i : i + 1 

mov 

[edi] 

, eax 

; *ptr = a 


&c [i] 
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6.4 Pairing CALL and RETURN 

Optimization 

Always use care when pairing CALLs and RETURNS. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

When the 12-entry return-address stack gets out of synchronization, the latency of returns increases. 
The retum-address stack becomes unsynchronized when: 

• Calls and returns do not match. 

• The depth of the return-address stack is exceeded because of too many levels of nested function 
calls. 
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6.5 Recursive Functions 

Optimization 

Use care when writing recursive functions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Returns are predicted as described in “Pairing CALL and RETURN,” so recursive functions should 
be written carefully. If there are only recursive function calls within the function as shown in the 
following example, the return address for each iteration of the recursive function is properly 
predicted. 

Preferred 

long fac(long a) 

{ 

if (a == 0) { 

return (1); 

} else { 

return (a * fac(a - 1)); 

} 

} 

If there are any other calls within the recursive function (except to itself) as shown in the next 
example, some returns can be mispredicted. If the number of recursive function calls plus the number 
of nonrecursive function calls within the recursive function is greater than 12, the return stack does 
not predict the correct return address for some of the returns once the recursion begins to unwind. 
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Avoid 

long fac(long a) 

{ 

if (a == 0) { 

return (1); 

} else { 

myp(a); // Can cause returns to be mispredicted 

return (a * fac(a - 1)); 

} 

} 

void myp(long a) 

{ 

printff'myp "); 
return; 

} 

Because the function fac, in the following example, is end-recursive, it can be converted to iterative 
code. A recursive function is classified as end-recursive when the function call to itself is at the end of 
the code. The following listing shows the rewritten code: 

Preferred 

long facl(long a) 

{ 

long t = 1; 
while (a > 0) { 
myp(a); 
t * = a ; 
a- - ; 

} 

return (t) ; 

} 
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6.6 Nonzero Code-Segment Base Values 

Optimization 

In 32-bit threads, avoid using a nonzero code-segment (CS) base value. (In 64-bit mode, segmentation 
is disabled and the segment base value is ignored and treated as zero.) 

Application 

This optimization applies to: 

• 32-bit software 

Rationale 

A nonzero CS base value causes an additional two cycles of branch-misprediction penalty when 
compared with a CS base value of zero: 


CS base value 

Minimum branch penalty (cycles) 

Prediction sequential 

Prediction taken 

Misprediction 

0 

0 

1 

10 

Not 0 

0 

1 

12 
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6.7 Replacing Branches with Computation 

Optimization 

Use computation to simulate predicted execution or conditional moves. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Branches can negatively impact the performance of code. If the body of the branch is small, you can 
achieve higher performance by replacing the branch with computation. The computation simulates 
predicated execution or conditional moves. There are many SSE and SSE2 instructions that can be 
useful for accomplishing this. The principal instructions are as follows: ANDPS, ANDPD, ANDNPS, 
ANDNPD, CMPPS, CMPSS, CMPPD, CMPSD, MINPS, MINSS, MINPD, MINSD, MAXPS, 
MAXSS, MAXPD, MAXSD, ORPS, ORPD, PAND, PANDN, PCMPEQB, PCMPEQD, 
PCMPEQW, PCMPGTB, PCMPGTD, PCMPGTW, PMAXSW, PMAXUB, PMINSW, PMINUB, 
POR, PXOR, XORPS, and XORPD. 

For 32-bit code using 3DNow!™ instructions, try to avoid moving the MMX™ data to integer 
registers to perform comparisons and branches. Moving MMX data to integer registers requires either 
transport through memory or the use of MOVD reg, mmreg instructions, which are relatively 
inefficient. When using 3DNow! technology and MMX registers, the following instructions may be 
useful for eliminating branches: PCMPGTB, PCMPGTD, PCMPGTW, PFCMPGT, PFCMPGE, 
PFMIN, PFMAX, PAND, PANDN, POR, and PXOR. 

Muxing Constructs 

The most important construct to use in avoiding branches in SIMD code is a two-way muxing 
construct that is equivalent to the ternary operator (? : ) in C and C++. 
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Examples 


SSE Solution (Preferred) 


; r = 

(x < y) 

? a : b 






; In: 

XMMO = 

a 







XMM1 = 

b 







XMM2 = 

X 







XMM3 = 

y 






; Out : 

XMMO = 

r 






cmpps 

xmm2, 

xmm3, 1 

; x 

< 

y 

7> 

Oxffffffff 

andps 

xmmO, 

xmm2 

; x 

< 

y 

7> 

a : 0 

andnps 

xmm2, 

xmml 

; x 

< 

y 

■? 

0 : b 

orps 

xmmO, 

xmm2 

; x 

< 

y 

7> 

a : b 


MMX™ Solution (Avoid) 


; r = 

(x < y) 

? a 

: b 


; In: 

MMO = 

a 




MM1 = 

b 




MM2 = 

X 




MM3 = 

y 



; Out : 

MMO = 

r 



pcmpgtd mm3, 

mm2 

; y > x ? Oxffffffff 

movq 

mm4 , 

mm3 

; Duplicate 

mask 

pandn 

mm3, 

mml 

; y > x ? 0 

: b 

pand 

mmO, 

mm4 

; y > x ? a 

: 0 

por 

mmO, 

mm3 

; r = y > X 

? a : b 


Because the use of PANDN destroys the mask created by PCMPGTD, the mask needs to be saved, 
which requires an additional register. This adds an instruction, lengthens the dependency chain, and 
increases register pressure. Therefore, write two-way muxing constructs as follows: 


MMX™ Solution (Preferred) 


; r = 

(x < y) 

? a 

b 





; In: 

MMO = 

a 







MM1 = 

b 







MM2 = 

X 







MM3 = 

y 






; Out : 

MMO = 

r 






pcmpgtd mm3, 

mm2 

; y 

> 

X 

■p 

Oxffffffff 

pand 

mmO, 

mm3 

; y 

> 

X 

7> 

a: 0 

pandn 

mm3, 

mml 

; y 

> 

X 

> 

0 : b 

por 

mmO, 

mm3 

; r 

= 

y 

> 

x ? a : b 


o 
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Sample Code Translated into AMD64 Code 

The following examples use scalar code translated into AMD64 code. Note that it is not 
recommended that you use 3DNow! SIMD instructions for scalar code, because the advantage of 
3DNow! instructions lies in their “SIMDness.” These examples are meant to demonstrate general 
techniques for translating source code with branches into branchless 3DNow! code. Scalar source 
code was chosen to keep the examples simple. These techniques work identically for vector code. 

Each example shows the C code and the resulting 3DNow! code. 

Example 1: C Code 

float x, y, z; 
if (x < y) { 
z += 1.0; 

} else { 

z -= 1.0; 

} 

Example 1: 3DNow!™ Code 

; In: MM0 = x 

; MM1 = y 

; MM2 = z 

; Out: MM0 = z 


movq 

mm3, 

mmO 

; Save 

X . 



movq 

mm4, 

one 

; 1.0 





pfcmpge 

mmO, 

mml 

; x < 

y 

p 

0 : 

Oxffffffff 

pslld 

mmO, 

31 

; x < 

y 

p 

0 : 

0x80000000 

pxor 

mmO, 

mm4 

; x < 

y 

p 

1.0 

: -1.0 

pf add 

mmO, 

mm2 

; x < 

y 

p 

z + 

1.0 : z - 


Example 2: C Code 

float x, z; 
z = abs(x); 
if (z >= 1) { 

z = 1 / z; 

} 

Example 2: 3DNow!™ Code 

; In: MM0 = x 
; Out: MM0 = z 

movq mm5, mabs ; 0x7fffffff 

pand mmO, mm5 ; z = abs(x) 

pfrcp mm2, mmO ; 1 / z approximation 

movq mml, mmO ; Save z. 

pfrcpitl mmO, mm2 ; 1 / z step 

pfrcpit2 mmO, mm2 ; 1 / z final 

pfmin mmO, mml ;z=z<l?z:l/z 
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Example 3: C Code 

float x, z, r, res; 
z = fabs(x) 
if (z < 0.575) { 

res = r; 

} else { 

res=PI/2-2*r; 

} 


Example 3: 3DNow!™ Code 


; In: 

MMO = 

X 




MM1 = 

r 



; Out : 

MMO = 

res 



movq 

mm7, 

mabs 

; Mask for absolute value 


pand 

mmO, 

mm 7 

; z = abs(x) 


movq 

mm2, 

bnd 

; 0.575 


pcmpgtd mm2, 

mmO 

; z < 0.575 ? Oxffffffff : 

0 

movq 

mm3, 

pio2 

; pi / 2 


movq 

mmO, 

mml 

; Save r. 


pf add 

mml, 

mml 

; 2 * r 


pfsubr 

mml, 

mm3 

; pi / 2 - 2 * r 


pand 

mmO, 

mm2 

; z < 0.575 ? r : 0 


pandn 

mm2, 

mml 

; z < 0.575 ? 0 : pi / 2 - 

2 

por 

mmO, 

mm2 

; z < 0.575 ? r : pi / 2 - 

2 


Example 4: C Code 

#define PI 3.14159265358979323 
float x, z, r, res; 

/* 0 <= r <= PI / 4 */ 
z = abs(x) 
if (z < 1) { 

res = r; 

} else { 

res = PI / 2 - r; 

} 


Example 4: 3DNow!™ Code 


; In: 

MMO = 

X 




MM1 = 

r 



; Out : 

MM1 = 

res 



movq 

mm5, 

mabs 

; Mask to 

clear sign bit 

movq 

mm6, 

one 

; 1.0 


pand 

mmO, 

mm 5 

; z = abs(x) 

pcmpgtd mm6, 

mmO 

; Z < 1 ? 

Oxffffffff : 0 

movq 

mm4, 

pio2 

; pi / 2 


pf sub 

mm4, 

mml 

;pi/2- 

r 

pandn 

mm6, 

mm4 

; Z < 1 ? 

0 : pi / 2 - r 

pfmax 

mml, 

mm 6 

; res = z 

< 1 ? r : pi / 
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Example 5: C Code 

#define PI 3.14159265358979323 
float x, y ,xa ,ya , r ,res; 
int xs, df; 

XS = x < 0 ? 1 : 0; 
xa = fabs(x); 
ya = fabs(y); 


df 

= (xa 

< ya) 

/ 

if 

(xs 

ScSc 

df) 

{ 


res 

= 

PI / 

2 

} 

else 

if 

(xs) 

{ 


res 

= 

PI - 

r; 

} 

else 

if 

(df) 

{ 


res 

= 

PI/2 

- 

} 

else 

{ 




res 

= 

r; 



} 


Example 5: 3DNow!™ Code 


; In: 

MMO = 

r 




MM1 = 

y 




MM2 = 

X 



; Out : 

MMO = 

res 



movq 

mm7, 

sgn 

; Mask 

to extract sign bit 

movq 

mm6 , 

sgn 

; Mask 

to extract sign bit 

movq 

mm5, 

mabs 

; Mask 

to clear sign bit 

pand 

mm7, 

mm2 

; XS = 

sign(x) 

pand 

mml, 

mm 5 

; ya = 

abs(y) 

pand 

mm2, 

mm 5 

; xa = 

abs(x) 

movq 

mm6, 

mml 

; y 


pcmpgtd mm6, 

mm2 

; df = 

(xa < ya) ? Oxffffffff 

pslld 

mm6, 

31 

; df = 

bit 31 

movq 

mm5, 

mm 7 

; XS 


pxor 

mm7, 

mm 6 

; XS 

df ? 0x80000000 : 0 

movq 

mm3, 

npio2 

; -pi 

/ 2 

pxor 

mm5, 

mm3 

; XS ? 

pi / 2 : -pi / 2 

psrad 

mm6, 

31 

; df ? 

Oxffffffff : 0 

pandn 

mm6, 

mm 5 

; XS ? 

(df ? 0 : pi / 2) : (df 

pf sub 

mm6, 

mm3 

; pr = 

pi / 2 + (xs ? (df ? 0 




; (df 

? 0 : -pi / 2)) 

por 

mmO, 

mm 7 

; ar = 

xs * df ? -r : r 

pf add 

mmO, 

mm 6 

; res 

= ar + pr 
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6.8 The LOOP Instruction 

Optimization 

Avoid using the LOOP instruction. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The LOOP instruction has a latency of at least 8 cycles. 

Example 

Avoid code like this, which uses the LOOP instruction: 

label: 

loop label 

Instead, replace the loop instruction with a DEC and a JNZ: 

label: 

dec rex 
jnz label 
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6.9 Far Control-Transfer Instructions 

Optimization 

Use far control-transfer instructions only when necessary. (Far control-transfer instructions include 
the far forms of JMP, CALL, and RET, as well as the INT, INTO, and IRET instructions.) 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The processor’s branch-prediction unit, which is used for both conditional and unconditional 
branches, does not predict far branches. 
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Chapter 7 Scheduling Optimizations 


The optimizations discussed in this chapter help improve scheduling in the processor. 
This chapter covers the following topics: 


Topic 

Page 

Instruction Scheduling by Latency 

144 

Loop Unrolling 

145 

Inline Functions 

149 

Address-Generation Interlocks 

151 

MOVZX and MOVSX 

153 

Pointer Arithmetic in Loops 

154 

Pushing Memory Data Directly onto the Stack 

157 
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7.1 Instruction Scheduling by Latency 

Optimization 

In general, select instructions with shorter latencies that are DirectPath—not VectorPath— 
instructions. For a list of instruction latencies and classifications, see Appendix C, “Instruction 
Latencies.” 

The AMD Athlon™ 64 and AMD Opteron™ processors can execute up to three AMD64 instructions 
per cycle, with each instruction possibly having a different latency. The AMD Athlon 64 and 
AMD Opteron processors have flexible scheduling, but for absolute maximum performance, schedule 
instructions according to their latencies and data dependencies. The goal is to reduce the overall 
length of dependency chains. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 
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7.2 Loop Unrolling 

Optimization 

Use loop unrolling where appropriate to increase instruction-level parallelism: 


If all of these conditions are true 

Then use 

• The loop is in a frequently executed piece of code. 

• The number of loop iterations is known at compile time. 

• The loop body includes fewer than 10 instructions. 

Complete loop unrolling 

• Spare registers are available (for example, when operating in 64-bit mode, 
where additional registers are available). 

• The loop body is small, so that loop overhead is significant. 

• The number of loop iterations is likely greater than 10. 

Partial loop unrolling 


Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Loop Unrolling 

Loop unrolling is a technique that duplicates the body of a loop one or more times in order to increase 
the number of instructions relative to the branch and allow operations from different loop iterations to 
execute in parallel. 

There are two types of loop unrolling: 

• Complete loop unrolling 

• Partial loop unrolling 
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Complete Loop Unrolling 

Complete loop unrolling eliminates the loop overhead completely by replacing the loop with copies of 
the loop body. 

Because complete loop unrolling removes the loop counter, it also reduces register pressure. 
However, completely unrolling very large loops can result in the inefficient use of the LI instruction 
cache. 

Example: Complete Loop Unrolling 

In the following C code, the number of loop iterations is known at compile time and the loop body is 
less than 100 instructions: 

#define ARRAY_LENGTH 3 

int sum, i, a[ARRAY_LENGTH]; 

sum = 0; 

for (i = 0; i < ARRAY_LENGTH; i++) { 

sum = sum + a [i] ; 

} 

To completely unroll an / 2 -iteration loop, remove the loop control and replicate the loop body n times: 

sum = 0; 

sum = sum + a[0] ; 
sum = sum + a [1] ; 
sum = sum + a [2] ; 

Partial Loop Unrolling 

Partial loop unrolling reduces the loop overhead by duplicating the loop body several times, changing 
the increment in the loop, and adding cleanup code to execute any leftover iterations of the loop. The 
number of times the loop body is duplicated is known as the unroll factor. 

However, partial loop unrolling may increase register pressure. 

Example: Partial Loop Unrolling 

In the following C code, each element of one array is added to the corresponding element of another 
array: 

double a[MAX_LENGTH], b[MAX_LENGTH]; 

for (i = 0; i < MAX_LENGTH; i++) { 

a [i] = a [i] + b [i] ; 

} 
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Without loop unrolling, this is the equivalent assembly-language code: 


mov ecx, MAX_LENGTH 
mov eax, OFFSET a 
mov ebx, OFFSET b 


; Initialize counter. 

; Load address of array a into EAX. 
; Load address of array b into EBX. 


add_loop: 

fid QWORD PTR [eax] 

fadd QWORD PTR [ebx] 

fstp QWORD PTR [eax] 

add eax, 8 

add ebx, 8 

dec ecx 

jnz add_loop 


; Push object pointed to by EAX onto the FP stack. 

; Add object pointed to by EBX to ST(0). 

; Copy ST(0) to object pointed to by EAX; pop ST(0). 
; Point to next element of array a. 

; Point to next element of array b. 

; Decrement counter. 

; If elements remain, then jump. 


The rolled loop consists of seven instructions. AMD Athlon 64 and AMD Opteron processors can 
decode and retire as many as three instructions per cycle, so it cannot execute faster than three 
iterations in seven cycles (3/7 of a floating-point add per cycle). However, the pipelined floating-point 
adder allows one add every cycle. 


3 instructions ^ iteration 1 FADD _ 3 FADDs 
cycle 7 instructions iteration 7 cycles 


0.429 FADDs/cycle 


After partial loop unrolling using an unroll factor of two, the new code creates a potential end case 
that must be handled outside the loop: 


mov ecx, MAX_LENGTH ; Initialize counter. 

mov eax, OFFSET a ; Load address of array a into EAX. 

mov ebx, OFFSET b ; Load address of array b into EBX. 


shr ecx, 1 
jnc add_loop 
; Handle the end case. 


Divide counter by 2 (the unroll factor). 
If original counter was even, then jump. 


fid 

QWORD 

PTR 

[eax] 

; Push object pointed to by EAX onto the FP 

stack. 

fadd 

QWORD 

PTR 

[ebx] 

; Add object pointed to by EBX to ST(0). 



fstp 

QWORD 

PTR 

[eax] 

; Copy ST(0) to object pointed to by EAX; pop 

ST(0). 

add 

eax, 

8 


; Point to next element of array a. 



add 

ebx, 

8 


; Point to next element of array b. 



add_loop: 






fid 

QWORD 

PTR 

[eax] 

; Push object pointed to by EAX onto the 

FP 

stack. 

fadd 

QWORD 

PTR 

[ebx] 

; Add object pointed to by EBX to ST(0). 



fstp 

QWORD 

PTR 

[eax] 

; Copy ST(0) to object pointed to by EAX; 

pop ST(0 

fid 

QWORD 

PTR 

[eax+8] 

; Repeat for next element. 



fadd 

QWORD 

PTR 

[ebx+8] 




fstp 

QWORD 

PTR 

[eax+8] 




add 

eax, 

16 


; Point to next element of array a. 



add 

ebx, 

16 


; Point to next element of array b. 



dec 

ecx 



; Decrement counter. 



j nz 

add_loop 


; If elements remain, then jump. 
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The unrolled loop consists of 10 instructions. Based on the decode/retire bandwidth of three 
instructions per cycle, this loop goes no faster than three iterations in 10 cycles (which is equivalent to 
6/10 of a floating-point add per cycle because there are two additions per iteration), or 1.4 times as 
fast as the original loop. 

3 instructions iteration 2 FADDs 6 FADDs n , , 

- -x——-:—X:-:— = - = 0.600 FADDs/cycle 

cycle 10 instructions iteration 10 cycles 

Deriving the Loop Control for Partially Unrolled Loops 

A frequently used loop construct is a counting loop. In a typical case, the loop count starts at some 
lower bound (low), increases by some fixed, positive increment (inc) for each iteration of the loop, 
and may not exceed some upper bound (high): 

for (k = low; k <= high; k += inc) { 
x[k] = .. . 

} 

The following code shows how to partially unroll such a loop by an unroll factor (factor) and how to 
derive the loop control for the partially unrolled version of the loop: 

for (k = low; k <= (high - (factor - 1) * inc); k += factor * inc) { 

// Begin the series of unrolled statements, 
x[k + 0 * inc] = ... 

// Continue the series if the unrolling factor is greater than 2. 
x[k + 1 * inc] = ... 
x[k + 2 * inc] = ... 

// End the series, 
x[k + (factor - 1) * inc] = ... 

} 

// Handle the end cases, 
for (k = k; k <= high; k += inc) { 
x[k] = .. . 

} 

Related Information 

For information on loop unrolling at the C-source level, see “Unrolling Small Loops” on page 13. 
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7.3 Inline Functions 

Optimization 

Use function inlining when: 

• A function is called from just one site in the code. (For the C language, determination of this 
characteristic is made easier if functions are explicitly declared static unless they require 
external linkage.) 

• A function—once inlined—contains fewer than 25 machine instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

There are advantages and disadvantages to function inlining. On the one hand, function inlining 
eliminates function-call overhead and allows better register allocation and instruction scheduling at 
the site of the function call. The disadvantage of function inlining is decreased code reference locality, 
which can increase execution time due to instruction cache misses. 

For functions that create fewer than 25 machine instructions once inlined, it is likely that the function- 
call overhead is close to, or more than, the time spent executing the function body. In these cases, 
function inlining is recommended. 

Function-call overhead on the AMD Athlon 64 and AMD Opteron processors can be low because 
calls and returns are executed very quickly due to the use of prediction mechanisms. However, there is 
still overhead due to passing function arguments through memory, which creates store-to-load- 
forwarding dependencies. (In 64-bit mode, this overhead is typically avoided by passing more 
arguments in registers, as specified in the AMD64 Application Binary Interface [ABI] for the 
operating system.) 

For longer functions, the benefits of reduced function-call overhead give diminishing returns. A 
function that results in the insertion of more than 500 machine instructions at the call site should 
probably not be inlined. Some larger functions might consist of multiple, relatively short paths that 
are negatively affected by function overhead. In such a case, it can be advantageous to inline larger 
functions. Profiling information is the best guide in determining whether to inline such large 
functions. 
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Additional Recommendations 

In general, function inlining works best if the compiler utilizes feedback from a profiler to identify the 
function calls most frequently executed. If such data is not available, a reasonable approach is to 
concentrate on function calls inside loops. Do not consider as candidates for inlining any functions 
that are directly recursive. However, if they are end-recursive, the compiler should convert them to an 
iterative equivalent to avoid potential overflow of the processor’s return-prediction mechanism (return 
stack) during deep recursion. For best results, a compiler should support function inlining across 
multiple source files. In addition, a compiler should provide intrinsic functions for commonly used 
library routines, such as sin, strcmp, or memcpy. 
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7.4 Address-Generation Interlocks 

Optimization 

Avoid address-generation interlocks by scheduling loads and stores whose addresses can be 
calculated quickly ahead of loads and stores that require the resolution of a long dependency chain in 
order to generate their addresses. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Address-Generation Interlocks 

An address-generation interlock is a condition in which newer loads and stores whose addresses have 
already been calculated by the processor are blocked by older loads and stores whose addresses have 
not yet been calculated. 

Rationale 

The processor schedules instructions that access the data cache (loads and stores) in program order. 
By carefully choosing the order of loads and stores, you can avoid address-generation interlocks. 
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Example 


Avoid code that places a load whose address takes longer to calculate before a load whose address can 
be determined more quickly: 


add ebx, 
mov eax, 
mov ecx, 
mov edx, 


ecx 

DWORD PTR 
DWORD PTR 
DWORD PTR 


[10h] 

[eax+ebx] 

[24h] 


Instruction 1 

Instruction 2 (fast address calc.) 

Instruction 3 (slow address calc.) 

This load is stalled from accessing the 
data cache due to the long latency 
caused by generating the address for 
instruction 3. 


Where possible, reorder instructions so that loads with simpler address calculations come before 
those with more complex address calculations: 


add 

mov 

ebx, 

eax, 

ecx 

DWORD 

PTR 

[10h] 

Instruction 1 

Instruction 2 

mov 

edx, 

DWORD 

PTR 

[24h] 

Place load above instruction 3 to avoid 

mov 

ecx, 

DWORD 

PTR 

[eax+ebx] 

address-generation interlock stall. 
Instruction 3 
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7.5 MOVZX and MOVSX 

Optimization 

Use the MOVZX and MOVSX instructions to zero-extend or sign-extend, respectively, an operand to 
a larger size. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Typical code for zero extension that replaces MOVZX uses more decode and execution resources than 
MOVZX. It also has higher latency due to the superset dependency between the XOR and the MOV, 
which requires a merge operation. 

Example 

When zero-extending an operand (in this case, a byte), avoid code such as the following: 


xor rax, rax 
mov al, mem 

Instead, use the MOVZX instruction: 

movzx rax, BYTE PTR mem 
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7.6 Pointer Arithmetic in Loops 

Optimization 

Minimize pointer arithmetic in loops, especially if the loop bodies are small. Take advantage of 
scaled-index addressing modes to utilize the loop counter as an index into memory arrays. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

In small loops, pointer arithmetic causes significant overhead. Using scaled-index addressing modes 
has no negative impact on execution speed, but the reduced number of instructions preserves decode 
bandwidth. 

Example 

Consider the following C code, which adds the elements of two arrays and stores them in a third 
array: 

int a[MAXSIZE], b [MAXSIZE], c[MAXSIZE], i; 

for (i = 0; i < MAXSIZE; i++) { 

c [i] = a [i] + b [i] ; 

} 
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Avoid an assembly-language equivalent like this, which uses base and displacement components (for 
example, [esi+a] ) to compute array-element addresses, requiring additional pointer arithmetic to 
increment the offsets into the forward-traversed arrays: 


mov 

ecx. 

MAXSIZE 

; Initialize loop counter. 


xor 

esi, 

esi 

; Initialize offset into array 

a. 

xor 

edi, 

edi 

; Initialize offset into array 

b. 

xor 

ebx, 

ebx 

; Initialize offset into array 

c. 

add_loop: 




mov 

eax. 

[esi+a] 

; Get element from a. 


mov 

edx. 

[edi+b] 

; Get element from b. 


add 

eax. 

edx 

; a [i] + b [i] 


mov 

[ebx+c], eax 

; Write result to c. 


add 

esi, 

4 

; Increment offset into a. 


add 

edi, 

4 

; Increment offset into b. 


add 

ebx, 

4 

; Increment offset into c. 


dec 

ecx 


; Decrement loop count 


j nz 

add_ 

loop 

; until loop count is 0. 


Instead, traverse the arrays in a downward direction (from higher to lower addresses), in order to take 


advantage of scaled-index addressing (for example, [ecx*4+a] ), which minimizes pointer arithmetic 
within the loop: 


mov ecx, MAXSIZE - 1 ; Initialize index. 


add_loop: 

mov eax, [ecx*4+a] 
mov edx, [ecx*4+b] 
add eax, edx 
mov [ecx*4+c], eax 
dec ecx 
jns add_loop 


Get element from a. 

Get element from b. 
a [i] + b [i] 

Write result to c. 
Decrement index 
until index is negative 


A change in the direction of traversal is possible only if each loop iteration is completely independent 
of the others. If you cannot change the direction of traversal for a given array, it is still possible to 
minimize pointer arithmetic by using as a base address a displacement that points to the byte past the 
end of the array, and using an index that starts with a negative value and reaches zero when the loop 
expires: 


mov ecx, (-MAXSIZE) ; Initialize index. 


add_loop: 

mov eax, [ecx*4+a+MAXSIZE*4] 
mov edx, [ecx*4+b+MAXSIZE*4] 
add eax, edx 

mov [ecx*4+c+MAXSIZE*4], eax 

inc ecx 

jnz add_loop 


Get element from a. 
Get element from b. 
a [i] + b [i] 

Write result to c. 
Increment index 
until index is 0. 
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If the base addresses of the arrays are held in registers (for example, when the base addresses are 
passed as the arguments of a function), biasing the base addresses requires additional instructions to 
perform the biasing at run time, and a small amount of additional overhead is incurred. 


156 


Scheduling Optimizations 


Chapter 7 



25112 Rev. 3.06 September 2005 


_ AM PH 

Software Optimization Guide for AMD64 Processors 


7.7 Pushing Memory Data Directly onto the Stack 

Optimization 

Push memory data directly onto the stack instead of loading it into a register first. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Pushing memory data directly onto the stack reduces register pressure and eliminates data 
dependencies. 

Example 

Avoid code that first loads the memory data into a register and then pushes it onto the stack: 

mov rax, mem 
push rax 

Instead, push the memory data directly onto the stack: 
push mem 
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Chapter 8 Integer Optimizations 


The optimizations in this chapter help improve integer performance. 
This chapter covers the following topics: 


Topic 

Page 

Replacing Division with Multiplication 

160 

Alternative Code for Multiplying by a Constant 

164 

Repeated String Instructions 

167 

Using XOR to Clear Integer Registers 

169 

Efficient 64-Bit Integer Arithmetic in 32-Bit Mode 

170 

Efficient Implementation of Population-Count Function in 32-Bit Mode 

179 

Efficient Binary-to-ASCII Decimal Conversion 

181 

Derivation of Algorithm, Multiplier, and Shift Factor for Integer Division by Constants 

186 

Optimizing Integer Division 

192 
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8.1 Replacing Division with Multiplication 

Optimization 

Replace integer division by constants with multiplication by the reciprocal. 

Rationale 

Because the AMD Athlon™ 64 and AMD Opteron™ processors have very fast integer multiplication 
(3-8 cycles signed, 3-8 cycles unsigned) and the integer division delivers only one bit of quotient per 
cycle (22-47 cycles signed, 17-41 cycles unsigned), the equivalent code is much faster. Either follow 
the examples in this chapter that illustrate the use of integer division by constants or create the 
executables using the code in “Derivation of Algorithm, Multiplier, and Shift Factor for Integer 
Division by Constants” on page 186. 

Multiplication by Reciprocal (Division) Utility 

The code for the utilities is shown in “Derivation of Algorithm, Multiplier, and Shift Factor for 
Integer Division by Constants” on page 186. The utilities provided in this document are for reference 
only and are not supported by AMD. 

Signed Division Utility 

The sdiv.exe utility finds the fastest code for signed division by a constant. The utility displays the 
code after the user enters a signed constant divisor. To redirect the code to a file, type the following 
command: 

sdiv > example.out 

Unsigned Division Utility 

The udiv. exe utility finds the fastest code for unsigned division by a constant. The utility displays 
the code after the user enters an unsigned constant divisor. To redirect the code to a file, type the 
following command: 

udiv > example.out 
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Unsigned Division by Multiplication of Constant 

Algorithm: Divisors 1 <= d < 2 31 , Odd d 

The following code shows an unsigned division using a constant value multiplier. 

; a = algorithm 
; m = multiplier 
; s = shift factor 

; a == 0 
mov eax, m 
mul dividend 

shr edx, s ; EDX = quotient 

; a = = 1 
mov eax, m 
mul dividend 
add eax, m 
adc edx, 0 

shr edx, s ; EDX = quotient 

Code for determining the algorithm (a), multiplier ( m ), and shift factor ( 5 ) from the divisor (d) is 
found in the section “Derivation of Algorithm, Multiplier, and Shift Factor for Integer Division by 
Constants” on page 186. 

Algorithm: Divisors 2 31 <= d < 2 32 

h 1 O'} 

For divisors 2 <= d < 2 , the possible quotient values are either 0 or 1. For this reason, it is easy to 

establish the quotient by simple comparison of the dividend and divisor.When the dividend needs to 
be preserved, consider using code like the following: 

; In: EAX = dividend 

; Out: EDX = quotient 

xor edx, edx ; 0 

cmp eax, d ; CF = (dividend < divisor) ? 1 : 0 

sbb edx, -1 ; quotient = 0 + 1 - CF = (dividend < divisor) ? 0 : 1 

When the dividend does not need to be preserved, the division can be accomplished without the use of 
an additional register, thus reducing register pressure, as shown here: 

; In: EAX = dividend 

; Out: EDX = quotient 

cmp edx, d ; CF = (dividend < divisor) ? 1 : 0 

mov eax, 0 ; 0 

sbb eax, -1 ; quotient = 0 + 1 - CF = (dividend < divisor) ? 0 : 1 
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Simpler Code for Restricted Dividend 

Integer division by a constant can be made faster if the range of the dividend is limited, which 
removes a shift associated with most divisors. For example, for a divide by 10 operation, use the 
following code if the dividend is less than 4000_0005h: 

mov eax, dividend 
mov edx, 01999999Ah 
mul edx 

mov quotient, edx 

Signed Division by Multiplication of Constant 

Algorithm: Divisors 2 <= d < 2 31 

These algorithms work if the divisor is positive. If the divisor is negative, use abs (d) instead of d, and 
append a neg edx instruction to the code. These changes make use of the fact that n/-d = -(nld). 

; a = algorithm 
; m = multiplier 
; s = shift count 


; a = = 0 

mov eax, m 

imul dividend 

mov eax, dividend 

shr eax, 31 

sar edx, s 

add edx, eax ; Quotient in EDX 

; a == 1 

mov eax, m 

imul dividend 

mov eax, dividend 

add edx, eax 

shr eax, 31 

sar edx, s 

add edx, eax ; Quotient in EDX 

Code for determining the algorithm (a), multiplier ( m ), and shift factor ( 5 ) is shown in “Derivation of 
Algorithm, Multiplier, and Shift Factor for Integer Division by Constants” on page 186. 

Signed Division by 2 

; In: EAX = dividend 
; Out: EAX = quotient 

cmp eax, 80000000h ; CF = 1 if dividend >= 0. 

sbb eax, -1 ; Increment dividend if it is < 0. 

sar eax, 1 ; Perform right shift. 
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Signed Division by 2 n 

; In: EAX = dividend 

; Out: EAX = quotient 

; Sign extend into EDX. 

; Mask correction (use divisor - 1) 

; Apply correction if necessary. 

; Perform right shift by log2(divisor). 

Signed Division by -2 

; In: EAX = dividend 

; Out: EAX = quotient 


cdq 

and edx, (2^n - 1) 
add eax, edx 
sar eax, (n) 


cmp eax, 8000000Oh 
sbb eax, -1 
sar eax, 1 
neg eax 


; CF = 1 if dividend >= 0. 

; Increment dividend if it is 
; Perform right shift. 

; Use (x / -2) == - (x / 2) . 


< 0 . 


Signed Division by -(2 n ) 

; In: EAX = dividend 
; Out: EAX = quotient 


cdq 

and edx, 
add eax, 
sar eax, 
neg eax 


(2"n - 1) 

edx 

(n) 


; Sign extend into EDX. 

; Mask correction (-divisor - 1). 

; Apply correction if necessary. 

; Right shift by log2(-divisor). 

; Use (x / -(2*n)) == (-(x / 2 A n)) . 


Remainder of Signed Division by 2 or -2 

; In: EAX = dividend 
; Out: EAX = remainder 


cdq 

and eax, 1 
xor eax, edx 
sub eax, edx 


; Sign extend into EDX. 
; Compute remainder. 

; Negate remainder if 
; dividend was < 0. 


Remainder of Signed Division by 2 n or ~(2 n ) 

; In: EAX = dividend 
; Out: EAX = remainder 


cdq 

and edx, 
add eax, 
and eax, 
sub eax, 


(2"n - 

1 

edx 


(2*n - 

1 

edx 



; Sign extend into EDX. 

; Mask correction (abs(divisor) - 1) 

; Apply pre-correction. 

; Mask out remainder (abs(divisor) - 1) 
; Apply pre-correction if necessary. 
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8.2 Alternative Code for Multiplying by a Constant 

Optimization 

Devise instruction sequences with lower latency to accomplish multiplication by certain constant 
multipliers. 

Rationale 

A 32-bit integer multiplied by a constant has a latency of 3 cycles; a 64-bit integer multiplied by a 
constant has a latency of 4 cycles. For certain constant multipliers, instruction sequences can be 
devised that accomplish the multiplication with lower latency. Because the AMD Athlon 64 and 
AMD Opteron processors contain only one integer multiplier but three integer execution units, the 
replacement code can provide better throughput as well. 

Most replacement sequences require the use of an additional temporary register, thus increasing 
register pressure. If register pressure in a piece of code that performs integer multiplication with a 
constant is already high, it could be better for the overall performance of that code to use the IMUL 
instruction instead of the replacement code. Similarly, replacement sequences with low latency but 
containing many instructions may negatively influence decode bandwidth as compared to the IMUL 
instruction. In general, replacement sequences containing more than four instructions are not 
recommended. 

The following code samples are designed for the original source to receive the final result. Other 
sequences are possible if the result is in a different register. Sequences that do not require a temporary 
register are favored over ones requiring a temporary register, even if the latency is higher. Arithmetic- 
logic-unit operations are preferred over shifts to keep code size small. Similarly, both arithmetic- 
logic-unit operations and shifts are favored over the LEA instruction. 

There are improvements in the AMD Athlon 64 and AMD Opteron processors’ multiplier over that of 
previous x86 processors. For this reason, when doing 32-bit multiplication, only use the alternative 
sequence if the alternative sequence has a latency that is less than or equal to 2 cycles. For 64-bit 
multiplication, only use the alternative sequence if the alternative sequence has a latency that is less 
than or equal to 3 cycles. 

Examples 


by 

2 : 

add 

regl, 

regl 

; 1 

cycle 

by 

3 : 

lea 

regl, 

[ regl + regl * 2 ] 

; 2 

cycles 

by 

4 : 

shl 

regl, 

2 

; 1 

cycle 

by 

5 : 

lea 

regl, 

[ regl+regl * 4 ] 

; 2 

cycles 
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by 6 : 

lea 

regl, 

[ regl + regl * 2 ] 

; 3 

cycles 


add 

regl, 

regl 



by 7: 

mov 

reg2, 

regl 

; 2 

cycles 


shl 

regl, 

3 




sub 

regl, 

reg2 



by 8: 

shl 

regl, 

3 

; 1 

cycle 

by 9: 

lea 

regl, 

[ regl+regl * 8 ] 

; 2 

cycles 

by 10 

: lea 

regl, 

[ regl+regl * 4 ] 

; 3 

cycles 


add 

regl, 

regl 



by 11 

: lea 

reg2, 

[ regl + regl * 8 ] 

; 3 

cycles 


add 

regl, 

regl 




add 

regl, 

reg2 



by 12 

: lea 

regl, 

[ regl + regl * 2 ] 

; 3 

cycles 


shl 

regl, 

2 



by 13 

: lea 

reg2, 

[ regl + regl * 2 ] 

; 3 

cycles 


shl 

regl, 

4 




sub 

regl, 

reg2 



by 14 

: lea 

reg2, 

[ regl+regl ] 

; 3 

cycles 


shl 

regl, 

4 




sub 

regl, 

reg2 



by 15 

: mov 

reg2, 

regl 

; 3 

cycles 


shl 

regl, 

4 




sub 

regl, 

reg2 



by 16 

: Shl 

regl, 

4 

; 1 

cycle 

by 17 

: mov 

reg2, 

regl 

; 2 

cycles 


shl 

regl, 

4 




add 

regl, 

reg2 



by 18 

: lea 

regl, 

[ regl + regl * 8 ] 

; 3 

cycles 


add 

regl, 

regl 



by 19 

: lea 

reg2, 

[ regl+regl * 2 ] 

; 3 

cycles 


shl 

regl, 

4 




add 

regl, 

reg2 



by 2 0 

: lea 

regl, 

[ regl + regl * 4 ] 

; 3 

cycles 


shl 

regl, 

2 



by 21 

: lea 

reg2, 

[ regl+regl * 4 ] 

; 3 

cycles 


shl 

regl, 

4 




Chapter 8 


Integer Optimizations 


165 



AMpg _ 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


add regl, reg2 


by 

22 : 

imul regl, 

22 

by 

23 : 

lea 

reg2, 

[ regl + regl * 8 ] 



shl 

regl, 

5 



sub 

regl, 

reg2 

by 

24 : 

lea 

regl, 

[ regl+regl * 2 ] 



shl 

regl, 

3 

by 

25 : 

lea 

reg2, 

[ regl+regl * 8 ] 



shl 

regl, 

4 



add 

regl, 

reg2 

by 

26 : 

imul regl, 

26 

by 

27 : 

lea 

reg2, 

[ regl + regl * 4 ] 



shl 

regl, 

5 



sub 

regl, 

reg2 

by 

28 : 

lea 

reg2, 

[REG1*4] 



shl 

regl, 

5 



sub 

regl, 

reg2 

by 

29 : 

lea 

reg2, 

[ regl + regl * 2 ] 



shl 

regl, 

5 



sub 

regl, 

reg2 

by 

30 : 

lea 

reg2, 

[regl+regl] 



shl 

regl, 

5 



sub 

regl, 

reg2 

by 

31 : 

mov 

reg2, 

regl 



shl 

regl, 

5 



sub 

regl, 

reg2 

by 

32 : 

shl 

regl, 

5 


; Use the IMUL instruction 
; 3 cycles 

; 3 cycles 

; 3 cycles 

; Use the IMUL instruction 
; 3 cycles 

; 3 cycles 

; 3 cycles 

; 3 cycles 

; 2 cycles 

; 1 cycle 
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8.3 Repeated String Instructions 

Optimization 

Avoid using the REP prefix when performing string operations, especially when copying blocks of 
memory. 

Rational 

In general, using the REP prefix to repeatedly perform string instructions is less optimal than other 
methods, especially when copying blocks of memory. For a discussion of alternate memory-copy 
methods, see “Memory Copy” on page 120. 

Latency of Repeated String Instructions 

Table 6 shows the latency of repeated string instructions on the AMD Athlon 64 and AMD Opteron 
processors. 

Table 6 lists the latencies with the direction flag (DF) = 0 (increment) and DF = 1 (decrement). In 
addition, these latencies are assumed for aligned memory operands. Note that for MOVS and STOS, 
when DF = 1, the overhead portion of the latency increases significantly. However, these types are 
less commonly found. The user should use the formula and round up to the nearest integer value to 
determine the latency. 


Table 6. Latency of Repeated String Instructions 



Number of Cycles 

Instruction 

When ECX = 0 

When ECX = c 1 , DF = 0 

When ECX = c 1 , DF = 1 

rep movs 

11 

15 + (1 * c) 

25 + (4/3 * c) 

rep stos 

11 

14 + (1 * c) 

24 + (1 * c) 

rep lods 

11 

15 + (2 * c) 

15 + (2 * c) 

rep seas 

11 

15 + (5/2 * c) 

15 + (5/2 * c) 

rep emps 

11 

16 + (10/3* c) 

16 + (10/3 * c) 

Note: 

1. c > 0 


Guidelines for Repeated String Instructions 

To help achieve good performance, the following sections contain guidelines for the careful 
scheduling of VectorPath repeated string instructions. 
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Use the Largest Possible Operand Size 

Always move data using the largest operand size possible. For example, use rep movsd rather than 
rep movsw, and rep movsw rather than rep movsb. Use rep stosd rather than rep stosw, and 
rep stosw rather than rep stosb. 

In 64-bit mode, a quadword data size is available and offers better performance (for example, 
rep movsq and rep stosq). 

Ensure DF = 0 (Increment) 

Always make sure that DF is 0 (increment) after execution of CLD for rep movs and rep stos. 

DF = 1 (decrement) is only needed for certain cases of overlapping rep movs (for example, source 
and destination overlap). 

While string instructions with DF = 1 (decrement) are slower, only the overhead part of the cycle 
equation is larger and not the throughput part. See Table 6 on page 167 for additional latency 
numbers. 

Align Source and Destination with Operand Size 

For rep movs, make sure that both the source and destination are aligned with regard to the operand 
size. Handle the end case separately, if necessary. If either source or destination cannot be aligned, 
make the destination aligned and the source misaligned. For rep stos, make the destination aligned. 

Inline REP String with Low Counts 

For repeat counts of less than 4k, expand REP string instructions into equivalent sequences of simple 
AMD64 instructions. Use an inline sequence of loads and stores to accomplish the move. Use a 
sequence of stores to emulate rep stos. This technique eliminates the setup overhead of REP 
instructions and increases instruction throughput. 

Use Loop for REP String with Low Variable Counts 

If the repeated count is variable, but is likely less than eight, use a simple loop to move/store the data. 
This technique avoids the overhead of rep movs and rep stos. 
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8.4 Using XOR to Clear Integer Registers 

Optimization 

To clear an integer register to all zeros, use the XOR instruction to exclusive OR the register with 
itself, as shown below. 

Rationale 

AMD Athlon 64 and AMD Opteron processors are able to avoid the false read dependency on the 
XOR instruction. 

Examples 

Acceptable 

mov reg, 0 

Preferred 

xor reg, reg 


Chapter 8 


Integer Optimizations 


169 



AMpg _ 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


8.5 Efficient 64-Bit Integer Arithmetic in 32-Bit Mode 

Optimization 

The following section contains a collection of code snippets and subroutines showing the efficient 
implementation of 64-bit arithmetic in 32-bit mode. Note that these are 32-bit recommendations, in 
64-bit mode it is important to use 64-bit integer instructions for best performance. 

Addition, subtraction, negation, and shifting are best handled by inline code. Multiplication, division, 
and the computation of remainders are less common operations and are usually implemented as 
subroutines. If these subroutines are used often, the programmer should consider inlining them. 
Except for division and remainder calculations, the following code works for both signed and 
unsigned integers. The division and remainder code shown works for unsigned integers, but can easily 
be extended to handle signed integers. 

64-Bit Addition 

; Add ECX:EBX to EDX:EAX, and place sum in EDX:EAX. 
add eax, ebx 
adc edx, ecx 

64-Bit Subtraction 

; Subtract ECX:EBX from EDX:EAX and place difference in EDX:EAX. 
sub eax, ebx 
sbb edx, ecx 

64-Bit Negation 

; Negate EDX:EAX. 
not edx 
neg eax 

sbb edx, -1 ; Fix: Increment high word if low word was 0. 

64-Bit Left Shift 

; Shift EDX:EAX left, shift count in ECX (count 
; applied modulo 64). 

shld edx, eax, cl ; First apply shift count. 

shl eax, cl ; mod 32 to EDX:EAX 

test ecx, 32 ; Need to shift by another 32? 

jz lshift_done ; No, done. 

mov edx, eax ; Left shift EDX:EAX 

xor eax, eax ; by 32 bits 

lshift done: 
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64-Bit Right Shift 

shrd eax, edx, cl 
shr edx, cl 
test ecx, 32 
jz rshift_done 
mov eax, edx 
xor edx, edx 

rshift_done: 

64-Bit Multiplication 

; _llmul computes the low-order half of the product of its 
; arguments, two 64-bit integers. 

; In: [ESP + 8] : [ESP + 4] = multiplicand 

; [ESP+16]:[ESP+12] = multiplier 

; Out: EDX:EAX = (multiplicand * multiplier) % 2^64 

; Destroys: EAX, ECX, EDX, EFlags 


; First apply shift count. 

; mod 32 to EDX:EAX 
; Need to shift by another 32? 
; No, done. 

; Left shift EDX:EAX 
; by 32 bits. 


llmul 

PROC 




mov 

edx, [esp+8] 

; multiplicand hi 


mov 

ecx, [esp+16] 

; multiplier hi 


or 

edx, ecx 

; One 

operand >= 2^32? 


mov 

edx, [esp+12] 

; multiplier lo 


mov 

eax, [esp+4] 

; multiplicand lo 


j nz 

twomul 

; Yes, 

need two multiplies. 


mul 

edx 

; multiplicand lo * multiplier lo 


ret 


; Done 

, return to caller. 


twomul: 





imul 

edx, [esp+8] 


; p3 lo = multiplicand hi * multiplier 

_ lQ 

imul 

ecx, eax 


; p2 lo = multiplier hi * multiplicand 

_!o 

add 

ecx, edx 


; p2_lo + p3_lo 


mul 

dword ptr [esp+12] 

; pi = multiplicand lo * multiplier lo 


add 

edx, ecx 


; pi + p2 lo + p3 lo = result in EDX:EAX 

ret 



; Done, return to caller. 



llmul ENDP 


64-Bit Unsigned Division 

; _ulldiv divides two unsigned 64-bit integers and returns the quotient. 

; In: [ESP+8]:[ESP+4] = dividend 

; [ESP+16]:[ESP+12] = divisor 

; Out: EDX:EAX = quotient of division 

; Destroys: EAX, ECX, EDX, EFlags 

_ulldiv PROC 

push ebx ; Save EBX as per calling convention, 

mov ecx, [esp+20] ; divisor_hi 

mov ebx, [esp+16] ; divisor_lo 
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mov 

edx, 

[esp+12] 

; dividend hi 

mov 

eax, 

[esp+8] 

; dividend lo 

test 

ecx, 

ecx 


; divisor > (2^32 - 1)? 

j nz 

big_ 

divisor 

; Yes 

divisor > 2^32 - 1. 

cmp 

edx, 

ebx 


; Only one division needed (ECX =0)? 

j ae 

two 

divs 


; Need two divisions. 

div 

ebx 



; EAX 

= quotient_lo 

mov 

edx, 

ecx 


; EDX 

= quotient hi = 0 (quotient in EDX:EAX) 

pop 

ebx 



; Restore EBX as per calling convention. 

ret 




; Done, return to caller. 

two_divs: 





mov 

ecx, 

eax 

/ 

Save dividend lo in ECX. 

mov 

eax, 

edx 

/ 

Get dividend hi. 

xor 

edx, 

edx 

/ 

Zero-extend it into EDX:EAX. 

div 

ebx 


/ 

quotient 

_hi in EAX 

xchg 

eax, 

ecx 

/ 

ECX = quotient hi, EAX = dividend lo 

div 

ebx 


/ 

EAX = quotient lo 

mov 

edx, 

ecx 

/ 

EDX = quotient hi (quotient in EDX:EAX) 

pop 

ebx 


/ 

Restore EBX as per calling convention. 

ret 



/ 

Done, return to caller. 

big_divisor: 





push 

edi 




Save EDI as per calling convention. 

mov 

edi, 

ecx 



Save divisor hi. 

shr 

edx, 

1 



Shift both divisor and dividend right 

rcr 

eax, 

1 



by 1 bit. 

ror 

edi, 

1 




rcr 

ebx, 

1 




bsr 

ecx, 

ecx 



ECX = number of remaining shifts 

shrd 

ebx, 

edi, 

cl 


Scale down divisor and dividend 

shrd 

eax, 

edx, 

cl 


such that divisor is less than 

shr 

edx, 

cl 



2^32 (that is, it fits in EBX). 

rol 

edi, 

1 



Restore original divisor hi. 

div 

ebx 




Compute quotient. 

mov 

ebx, 

[esp+12] 


dividend lo 

mov 

ecx, 

eax 



Save quotient. 

imul 

edi, 

eax 



quotient * divisor high word (low only) 

mul 

dword ptr 

[esp+20] 

quotient * divisor low word 

add 

edx, 

edi 



EDX:EAX = quotient * divisor 

sub 

ebx, 

eax 



dividend_lo - (quot.*divisor)_lo 

mov 

eax, 

ecx 



Get quotient. 

mov 

ecx, 

[esp+16] 


dividend hi 

sbb 

ecx, 

edx 



Subtract (divisor * quot.) from dividend. 

sbb 

eax, 

0 



Adjust quotient if remainder negative. 

xor 

edx, 

edx 



Clear high word of quot. (EAX<=FFFFFFFFh) 

pop 

edi 




Restore EDI as per calling convention. 

pop 

ebx 




Restore EBX as per calling convention. 

ret 





Done, return to caller. 


ulldiv ENDP 
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64-Bit Signed Division 

; _lldiv divides two signed 64-bit numbers and delivers the quotient 


; In: 


[ESP+8] : [ESP+4] = dividend 


! 


[ESP+16]: 

[ESP+12] = divisor 


; Out : 


EDX:EAX = 

quotient of division 


; Destroys: 

EAX, ECX,E DX, 

EFlags 


_lldiv PROC 





push 

ebx 

; Save 

EBX 

as per calling convention. 


push 

esi 

; Save 

ESI 

as per calling convention. 


push 

edi 

; Save 

EDI 

as per calling convention. 


mov 

ecx, 

[esp+28] 

/ 

divisor hi 


mov 

ebx, 

[esp+24] 

/ 

divisor lo 


mov 

edx, 

[esp+20] 

/ 

dividend hi 


mov 

eax, 

[esp+16] 

/ 

dividend lo 


mov 

esi, 

ecx 

/ 

divisor hi 


xor 

esi, 

edx 

/ 

divisor hi A dividend hi 


sar 

esi, 

31 

/ 

(quotient < 0) ? -1 : 0 


mov 

edi, 

edx 

/ 

dividend hi 


sar 

edi, 

31 

/ 

(dividend < 0) ? -1 : 0 


xor 

eax, 

edi 

/ 

If (dividend < 0), 


xor 

edx, 

edi 

/ 

compute 1 1 s complement of 

dividend. 

sub 

eax, 

edi 

/ 

If (dividend < 0), 


sbb 

edx, 

edi 

/ 

compute 2 1 s complement of 

dividend. 

mov 

edi, 

ecx 

/ 

divisor hi 


sar 

edi, 

31 

/ 

(divisor < 0) ? -1 : 0 


xor 

ebx, 

edi 

/ 

If (divisor < 0), 


xor 

ecx, 

edi 

/ 

compute 1 1 s complement of 

divisor. 

sub 

ebx, 

edi 

/ 

If (divisor < 0), 


sbb 

ecx, 

edi 

/ 

compute 2 1 s complement of 

divisor. 

j nz 

big_ 

divisor 

/ 

divisor > 2^32 - 1 


cmp 

edx, 

ebx 

/ 

Only one division needed (ECX =0)? 

j ae 

two 

divs 

/ 

Need two divisions. 


div 

ebx 


/ 

EAX = quotient lo 


mov 

edx, 

ecx 

/ 

EDX = quotient hi = 0 (quotient in EDX 

xor 

eax, 

esi 

/ 

If (quotient < 0), 


xor 

edx, 

esi 

/ 

compute 1 1 s complement of 

result. 

sub 

eax, 

esi 

/ 

If (quotient < 0), 


sbb 

edx, 

esi 

/ 

compute 2 1 s complement of 

result. 

pop 

edi 


/ 

Restore EDI as per calling 

convention. 

pop 

esi 


/ 

Restore ESI as per calling 

convention. 

pop 

ebx 


/ 

Restore EBX as per calling 

convention. 

ret 



/ 

Done, return to caller. 


two_divs: 





mov 

ecx. 

eax 

; Save dividend lo in ECX. 


mov 

eax. 

edx 

; Get 

dividend hi. 


xor 

edx. 

edx 

; Zero-extend it into EDX:EAX. 


div 

ebx 


; quotient hi in EAX 


xchg 

eax, 

ecx 

; ECX = quotient hi, EAX = dividend lo 
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div 

ebx 

; EAX 

= quotient_lo 

mov 

edx, ecx 

; EDX 

= quotient hi (quotient in EDX:EAX) 

jmp 

make sign 

; Make 

quotient signed. 


big_divisor: 


sub 

esp, 

12 


Create three local variables. 

mov 

[esp] 

, eax 

dividend lo 

mov 

[esp+4], ebx 

divisor lo 

mov 

[esp+8], edx 

dividend hi 

mov 

edi, 

ecx 


Save divisor hi. 

shr 

edx, 

1 


Shift both 

rcr 

eax, 

1 


divisor and 

ror 

edi, 

1 


and dividend 

rcr 

ebx, 

1 


right by 1 bit. 

bsr 

ecx, 

ecx 


ECX = number of remaining shifts 

shrd 

ebx, 

edi, 

cl 

Scale down divisor and 

shrd 

eax, 

edx, 

cl 

dividend such that divisor is 

shr 

edx, 

cl 


less than 2^32 (that is, fits in EBX). 

rol 

edi, 

1 


Restore original divisor hi. 

div 

ebx 



Compute quotient. 

mov 

ebx, 

[esp] 


dividend lo 

mov 

ecx, 

eax 


Save quotient. 

imul 

edi, 

eax 


quotient * divisor high word (low only) 

mul 

DWORD PTR 

[esp+4] 

quotient * divisor low word 

add 

edx, 

edi 


EDX:EAX = quotient * divisor 

sub 

ebx, 

eax 


dividend lo - (quot.*divisor) lo 

mov 

eax, 

ecx 


Get quotient. 

mov 

ecx, 

[esp+8] 

dividend hi 

sbb 

ecx, 

edx 


Subtract (divisor * quot.) from dividend 

sbb 

eax, 

0 


Adjust quotient if remainder is negative 

xor 

edx, 

edx 


Clear high word of quotient. 

add 

esp, 

12 


Remove local variables. 


make_sign: 


xor 

eax, 

esi 

; If (quotient 

< 0) , 


xor 

edx, 

esi 

; compute 1 1 s 

complement of 

result. 

sub 

eax, 

esi 

; If (quotient 

< 0) , 


sbb 

edx, 

esi 

; compute 2 1 s 

complement of 

result. 

pop 

edi 


; Restore EDI 

as per calling 

convention 

pop 

esi 


; Restore ESI 

as per calling 

convention 

pop 

ebx 


; Restore EBX 

as per calling 

convention 

ret 



; Done, return 

to caller. 



lldiv ENDP 


64-Bit Unsigned Remainder Computation 

; _ullrem divides two unsigned 64-bit integers and returns the remainder. 

; In: [ESP + 8] : [ESP + 4] = dividend 

; [ESP + 16] : [ESP + 12] = divisor 

; Out: EDX:EAX = remainder of division 
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; Destroys: EAX, ECX, EDX, EFlags 


ullrem 

PROC 



push 

ebx 


; Save EBX as per calling convention. 

mov 

ecx, 

[esp+20] 

; divisor hi 

mov 

ebx, 

[esp+16] 

; divisor lo 

mov 

edx, 

[esp+12] 

; dividend hi 

mov 

eax, 

[esp+8] 

; dividend lo 

test 

ecx, 

ecx 

; divisor > 2^32 - 1? 

j nz 

r_big 

_divisor 

; Yes, divisor > 32^32 - 1. 

cmp 

edx, 

ebx 

; Only one division needed (ECX = 0)? 

j ae 

r two 

_divs 

; Need two divisions. 

div 

ebx 


; EAX = quotient lo 

mov 

eax, 

edx 

; EAX = remainder lo 

mov 

edx, 

ecx 

; EDX = remainder hi = 0 

pop 

ret 

ebx 


; Restore EBX per calling convention. 

; Done, return to caller. 


r two divs: 


mov 

ecx, 

eax 

; Save 

dividend lo in ECX. 

mov 

eax, 

edx 

; Get dividend hi. 

xor 

edx, 

edx 

; Zero- 

extend it into EDX:EAX. 

div 

ebx 


; EAX = 

quotient hi, EDX = intermediate remainder 

mov 

eax, 

ecx 

; EAX = 

dividend lo 

div 

ebx 


; EAX = 

quotient_lo 

mov 

eax, 

edx 

; EAX = 

remainder lo 

xor 

edx, 

edx 

; EDX = 

remainder hi = 0 

pop 

ebx 


; Restore EBX as per calling convention. 

ret 



; Done, 

return to caller. 

ig divisor: 




push 

edi 




Save EDI as per calling convention. 

mov 

edi, 

ecx 



Save divisor hi. 

shr 

edx, 

1 



Shift both divisor and dividend right 

rcr 

eax, 

1 



by 1 bit. 

ror 

edi, 

1 




rcr 

ebx, 

1 




bsr 

ecx, 

ecx 



ECX = number of remaining shifts 

shrd 

ebx, 

edi, 

cl 


Scale down divisor and dividend such 

shrd 

eax, 

edx, 

cl 


that divisor is less than 2 A 32 

shr 

edx, 

cl 



(that is, it fits in EBX). 

rol 

edi, 

1 



Restore original divisor (EDI:ESI). 

div 

ebx 




Compute quotient. 

mov 

ebx, 

[esp+12] 


dividend low word 

mov 

ecx, 

eax 



Save quotient. 

imul 

edi, 

eax 



quotient * divisor high word (low only) 

mul 

DWORD PTR 

[esp+20] 

quotient * divisor low word 

add 

edx, 

edi 



EDX:EAX = quotient * divisor 

sub 

ebx, 

eax 



dividend_lo - (quot.*divisor)_lo 

mov 

ecx, 

[esp+16] 


dividend hi 
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mov 

eax, 

[esp+20] 

divisor lo 


sbb 

ecx, 

edx 

Subtract divisor * quot. from 

dividend 

sbb 

edx, 

edx 

(remainder < 0) ? OxFFFFFFFF : 

0 

and 

eax, 

edx 

(remainder < 0) ? divisor lo : 

0 

and 

edx, 

[esp+24] 

(remainder < 0) ? divisor hi : 

0 

add 

eax, 

ebx 

remainder += (remainder < 0) ? 

divisor 

pop 

edi 


Restore EDI as per calling convention. 

pop 

ebx 


Restore EBX as per calling convention. 

ret 



Done, return to caller. 


.lrem 

ENDP 





64-Bit Signed Remainder Computation 

; _llrem divides two signed 64-bit numbers and returns the remainder. 

; In: [ESP+8]:[ESP+4] = dividend 

; [ESP + 16] : [ESP + 12] = divisor 

; Out: EDX:EAX = remainder of division 


Destroys: 

EAX, ECX, EDX, 

EFlags 



push 

ebx 


Save EBX as per calling 

convention. 

push 

esi 


Save ESI as per calling 

convention. 

push 

edi 


Save EDI as per calling 

convention. 

mov 

ecx, 

[esp+28] 

divisor-hi 



mov 

ebx, 

[esp+24] 

divisor-lo 



mov 

edx, 

[esp+20] 

dividend-hi 



mov 

eax, 

[esp+16] 

dividend-lo 



mov 

esi, 

edx 

sign(remainder) == sign(dividend) 

sar 

esi, 

31 

(remainder < 0) ? -1 : 0 


mov 

edi, 

edx 

dividend-hi 



sar 

edi, 

31 

(dividend < 0) ? -1 : 0 



xor 

eax, 

edi 

If (dividend < 0), 



xor 

edx, 

edi 

compute 1 1 s complement 

of 

dividend 

sub 

eax, 

edi 

If (dividend < 0), 



sbb 

edx, 

edi 

compute 2 1 s complement 

of 

dividend 

mov 

edi, 

ecx 

divisor-hi 



sar 

edi, 

31 

(divisor < 0) ? -1 : 0 



xor 

ebx, 

edi 

If (divisor < 0), 



xor 

ecx, 

edi 

compute 1 1 s complement 

of 

divisor. 

sub 

ebx, 

edi 

If (divisor < 0), 



sbb 

ecx, 

edi 

compute 2 1 s complement 

of 

divisor. 

j nz 

sr_big_divisor 

divisor > 2^32 - 1 



cmp 

edx, 

ebx 

Only one division needed (ECX = 0)? 

j ae 

sr_two_divs 

No, need two divisions. 



div 

ebx 


EAX = quotient lo 



mov 

eax, 

edx 

EAX = remainder lo 



mov 

edx, 

ecx 

EDX = remainder lo = 0 



xor 

eax, 

esi 

If (remainder < 0), 



xor 

edx, 

esi 

compute 1 1 s complement 

of 

result. 
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sub eax, esi ; If (remainder < 0), 

sbb edx, esi ; compute 2's complement of result, 

pop edi ; Restore EDI as per calling convention, 

pop esi ; Restore ESI as per calling convention, 

pop ebx ; Restore EBX as per calling convention, 

ret ; Done, return to caller. 


sr_two_divs: 

mov ecx, eax 
mov eax, edx 
xor edx, edx 
div ebx 
mov eax, ecx 
div ebx 
mov eax, edx 
xor edx, edx 
jmp sr_makesign 


; Save dividend_lo in ECX. 

; Get dividend_hi. 

; Zero-extend it into EDX:EAX. 

; EAX = quotient_hi, EDX = intermediate remainder 
; EAX = dividend_lo 
; EAX = quotient_lo 
; remainder_lo 
; remainder_hi = 0 
; Make remainder signed. 


sr_big_divisor: 


sub 

esp, 

16 


Create three local variables. 

mov 

[esp] 

, eax 

dividend lo 

mov 

[esp+4], ebx 

divisor lo 

mov 

[esp+8], edx 

dividend hi 

mov 

[esp+12], 

ecx 

divisor hi 

mov 

edi, 

ecx 


Save divisor hi. 

shr 

edx, 

1 


Shift both 

rcr 

eax, 

1 


divisor and 

ror 

edi, 

1 


and dividend 

rcr 

ebx, 

1 


right by 1 bit. 

bsr 

ecx, 

ecx 


ECX = number of remaining shifts 

shrd 

ebx, 

edi, 

cl 

Scale down divisor and 

shrd 

eax, 

edx, 

cl 

dividend such that divisor is 

shr 

edx, 

cl 


less than 2^32 (that is, fits in EBX). 

rol 

edi, 

1 


Restore original divisor hi. 

div 

ebx 



Compute quotient. 

mov 

ebx, 

[esp] 


dividend lo 

mov 

ecx, 

eax 


Save quotient. 

imul 

edi, 

eax 


quotient * divisor high word (low only) 

mul 

DWORD PTR 

[esp+4] 

quotient * divisor low word 

add 

edx, 

edi 


EDX:EAX = quotient * divisor 

sub 

ebx, 

eax 


dividend lo - (quot.*divisor) lo 

mov 

ecx, 

[esp+8] 

dividend hi 

sbb 

ecx, 

edx 


Subtract divisor * quot. from dividend. 

sbb 

eax, 

eax 


remainder < 0 ? Oxffffffff : 0 

mov 

edx, 

[esp+12] 

divisor hi 

and 

edx, 

eax 


remainder < 0 ? divisor hi : 0 

and 

eax, 

[esp+4] 

remainder < 0 ? divisor lo : 0 

add 

eax, 

ebx 


remainder lo 

add 

edx, 

ecx 


remainder hi 

add 

esp, 

16 


Remove local variables. 
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sr_makesign: 
xor eax, esi 
xor edx, esi 
sub eax, esi 
sbb edx, esi 
pop edi 
pop esi 
pop ebx 
ret 


; If (remainder < 0) , 

; compute l's complement of result. 

; If (remainder < 0) , 

; compute 2's complement of result. 

; Restore EDI as per calling convention. 

; Restore ESI as per calling convention. 

; Restore EBX as per calling convention. 

; Done, return to caller. 
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8.6 Efficient Implementation of Population-Count 
Function in 32-Bit Mode 

Population count is an operation that determines the number of set bits in a bit string. For example, 
this can be used to determine the cardinality of a set. The example code in this section shows how to 
efficiently implement a population count operation for 32-bit operands. The example is written for the 
inline assembler of Microsoft® Visual C. 

Function popcount implements a branchless computation of the population count. It is based on a 
0(log(n)) algorithm that successively groups the bits into groups of 2, 4, 8, 16, and 32, while 
maintaining a count of the set bits in each group. The algorithm consists of the following steps: 

1. Partition the integer into groups of two bits. Compute the population count for each 2-bit group 
and store the result in the 2-bit group. This calls for the following transformation to be performed 
for each 2-bit group: 

00b -> 00b 
01b -> 01b 
10b -> 01b 
lib -> 10b 

If the original value of a 2-bit group is v, then the new value will be v - (v » 1). In order to handle 
all 2-bit groups simultaneously, it is necessary to mask appropriately to prevent spilling from one 
bit group to the next lower bit group. Thus: 

w = v - ((v >> 1) & 0x55555555) 

2. Add the population count of adjacent 2-bit group and store the sum to the 4-bit group resulting 
from merging these adjacent 2-bit groups. To do this simultaneously to all groups, mask out the 
odd numbered groups, mask out the even numbered groups, and then add the odd numbered 
groups to the even numbered groups: 

x = (w & 0x33333333) + ((w >> 2) & 0x33333333) 

Each 4-bit field now has one of the following values: 0000b, 0001b, 0010b, 0011b, or 0100b. 

3. For the first time, the value in each k- bit field is small enough that adding two A-bit fields results 
in a value that still fits in the k- bit field. Thus the following computation is performed: 

y = (x + (x >> 4)) & OxOFOFOFOF 

The result is four 8-bit fields whose lower half has the desired sum and whose upper half contains 
“junk” that has to be masked out. A symbolic form is as follows: 

x = OaaaObbbOcccOdddOeeeOfffOgggOhhh 

x >> 4 = OOOOOaaaObbbOcccOdddOeeeOfffOggg 
sum = OaaaWWWWiiiiXXXXj j j j YYYYkkkkZZZZ 

The WWWW, XXXX, YYYY, and ZZZZ values are the interesting sums with each at most 
1000 b, or 8 decimal. 
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4. The four 4-bit sums can now be rapidly accumulated by multiplying with a so-called magic 
multiplier. This can be derived from looking at the following chart of partial products: 

OpOqOrOs * 01010101 = 

:OpOqOrOs 
Op:OqOrOs 
OpOq:OrOs 
OpOqOr:Os 
0 0 Opxxww:vvuut 10 S 

Here p, q, r, and s are the 4-bit sums from the previous step, and w is the final interesting result. 
The final result is as follows: 

z = (y * 0x01010101) >> 24 

Integer Version 

unsigned int popcount(unsigned int v) 

{ 

unsigned int retVal; 


asm { 




mov 

eax, 

[v] 

; v 

mov 

edx, 

eax 

; v 

shr 

eax, 

1 

; V > > 1 

and 

eax, 

055555555h 

; (v >> 1) & 0x55555555 

sub 

edx, 

eax 

; w = v - ((v >> 1) & 0x55555555) 

mov 

eax, 

edx 

; w 

shr 

edx, 

2 

; w > > 2 

and 

eax, 

033333333b 

; w & 0x33333333 

and 

edx, 

033333333b 

; (w >> 2) & 0x33333333 

add 

eax, 

edx 

; x = (w & 0x33333333) + ((w >> 2 
; 0x33333333) 

mov 

edx, 

eax 

; x 

shr 

eax, 

4 

; x > > 4 

add 

eax, 

edx 

; x + (x > > 4) 

and 

eax, 

OOFOFOFOFh 

; y = (x + (x >> 4) & OxOFOFOFOF) 

imul 

eax, 

OOlOlOlOlh 

; y * 0x01010101 

shr 

eax, 

24 

; population count = (y * 

; 0x01010101) >> 24 

mov 

retVal, eax 

; Store result. 


} 

return(retVal); 

} 
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8.7 Efficient Binary-to-ASCII Decimal Conversion 

Fast binary-to-ASCII decimal conversion can be important to the performance of software working 
with text oriented protocols like HTML, such as web servers. The following examples show two 
optimized functions for fast conversion of unsigned integers-to-ASCII decimal strings on 
AMD Athlon 64 and AMD Opteron processors. The code is written for the Microsoft Visual C 
compiler. 

The function uint_to_ascii_lz converts l ik e sprintf (sptr, "%oiou" , x) . That is, leading zeros 
are retained, whereas uint_to_ascii_nlz converts like sprintf (sptr, "%u", x) ; that is, leading 
zeros are suppressed. 

This code can easily be extended to convert signed integers by isolating the sign information and 
computing the absolute value as shown in Listing on page 130 before starting the conversion process. 
For restricted argument ranges, construct more efficient conversion routines using the same algorithm 
as used for the general case presented here. 

The algorithm first splits the input argument into suitably sized blocks by dividing the input by an 
appropriate power of ten and working separately on the quotient and remainder of that division. The 
DIV instruction is avoided as described in “Replacing Division with Multiplication” on page 160. 
Each block is then converted into a fixed-point format that consists of one (decimal) integer digit and 
a binary fraction. This allows the generation of additional decimal digits by repeated multiplication of 
the fraction by 10. For efficiency reasons the algorithm implements this multiplication by multiplying 
by five and moving the binary point to the right by one bit for each step of the algorithm. To avoid 
loop overhead and branch mispredictions, the digit generation loop is completely unrolled. In order to 
maximize parallelism, the code in uint to ascii lz splits the input into two equally sized blocks 
each of which yields five decimal digits for the result. 

Binary-to-ASCII Decimal Conversion Retaining Leading Zeros 

_declspec(naked) void _stdcall uint_to_ascii_lz(char *sptr, unsigned int x) 


push 

edi 


Save as per calling 

conventions. 


push 

esi 


Save as per calling 

conventions. 


push 

ebx 


Save as per calling 

conventions. 


mov 

eax, 

[esp+20] 

X 



mov 

edi, 

[esp+16] 

sptr 



mov 

esi, 

eax 

X 



mov 

edx, 

0xA7C5AC47 

Divide x by 



mul 

edx 


10,000 using 



add 

eax, 

0xA7C5AC47 

multiplication 



adc 

edx, 

0 

with reciprocal. 



shr 

edx, 

16 

yl = x / le5 



mov 

ecx, 

edx 

yi 



imul 

edx, 

100000 

(x / le5) * le5 



sub 

esi, 

edx 

y2 = x % le5 



mov 

eax, 

0XD1B71759 

2^15 / le4 * 2 A 30 



mul 

ecx 


Divide yl by le4. 
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shr 

eax, 

30 

converting it into 

lea 

ebx, 

[eax+edx*4+l] 

17.15 fixed-point 

format 

mov 

ecx, 

ebx 

such that 1.0=2 

"15. 

mov 

eax, 

0XD1B71759 

2"l5 / le4 * 2 "3 0 


mul 

esi 


Divide y2 by le4. 


shr 

eax, 

30 

converting it into 

lea 

esi, 

[eax+edx*4+l] 

17.15 fixed-point 

format 

mov 

edx, 

esi 

such that 1.0=2 

"15. 

shr 

ecx, 

15 

1st digit 


and 

ebx, 

0x00007fff 

Fraction part 


OR 

ecx, 

' 0 ' 

Convert 1st digit 

to ASCII. 

mov 

[edi+0], cl 

Store 1st digit in 

memory. 

lea 

ecx, 

[ebx+ebx*4] 

5 * fraction, new 

digit ECX [31-14] 

lea 

ebx, 

[ebx+ebx*4] 

5 * fraction, new 

fraction EBX [13-0] 

shr 

edx, 

15 

6th digit 


and 

esi, 

0x00007fff 

Fraction part 


or 

edx, 

' 0 ' 

Convert 6th digit 

to ASCII. 

mov 

[edi+5], dl 

Store 6th digit in 

memory. 

lea 

edx, 

[esi+esi*4] 

5 * fraction, new 

digit EDX [31-14] 

lea 

esi, 

[esi+esi*4] 

5 * fraction, new 

fraction ESI [13-0] 

shr 

ecx, 

14 

2nd digit 


and 

ebx, 

0x0 0 0 03 f f f 

Fraction part 


or 

ecx, 

' 0 ' 

Convert 2nd digit 

to ASCII. 

mov 

[edi+l], cl 

Store 2nd digit in 

memory. 

lea 

ecx, 

[ebx+ebx*4] 

5 *f raction, new 

digit ECX [31-13] 

lea 

ebx, 

[ebx+ebx*4] 

5 * fraction, new 

fraction EBX [12-0] 

shr 

edx, 

14 

7th digit 


and 

esi, 

0x0 0 0 03 f f f 

Fraction part 


or 

edx, 

' 0 ' 

Convert 7th digit 

to ASCII. 

mov 

[edi+6], dl 

Store 7th digit in 

memory. 

lea 

edx, 

[esi+esi*4] 

5 * fraction, new 

digit EDX [31-13] 

lea 

esi, 

[esi+esi*4] 

5 * fraction, new 

fraction ESI [12-0] 

shr 

ecx, 

13 

3rd digit 


and 

ebx, 

OxOOOOlfff 

Fraction part 


or 

ecx, 

' 0 ' 

Convert 3rd digit 

to ASCII. 

mov 

[edi+2], cl 

Store 3rd digit in 

memory. 

lea 

ecx, 

[ebx+ebx*4] 

5 * fraction, new 

digit ECX [31-12] 

lea 

ebx, 

[ebx+ebx*4] 

5 * fraction, new 

fraction EBX [11-0] 

shr 

edx, 

13 

8th digit 


and 

esi, 

OxOOOOlfff 

Fraction part 


or 

edx, 

' 0 ' 

Convert 8th digit 

to ASCII. 

mov 

[edi+7], dl 

Store 8th digit in 

memory. 

lea 

edx, 

[esi+esi*4] 

5 * fraction, new 

digit EDX [31-12] 

lea 

esi, 

[esi+esi*4] 

5 * fraction, new 

fraction ESI [11-0] 

shr 

ecx, 

12 

4th digit 


and 

ebx, 

OxOOOOOfff 

Fraction part 


or 

ecx, 

' 0 ' 

Convert 4th digit 

to ASCII. 

mov 

[edi+3], cl 

Store 4th digit in 

memory. 

lea 

ecx, 

[ebx+ebx*4] 

5 * fraction, new 

digit ECX [31-11] 

shr 

edx, 

12 

9th digit 


and 

esi, 

OxOOOOOfff 

Fraction part 
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or 

edx, 

' 0 ' 


Convert 9th digit to ASCII. 


mov 

[edi+8], 

dl 

Store 9th digit in memory. 


lea 

edx, 

[esi+esi*4] 

5 * fraction, new digit EDX [31- 

11] 

shr 

ecx, 

11 


5th digit 


or 

ecx, 

' 0 ' 


Convert 5th digit to ASCII. 


mov 

[edi+4] , 

cl 

Store 5th digit in memory. 


shr 

edx, 

11 


10th digit 


or 

edx, 

' 0 ' 


Convert 10th digit to ASCII. 


mov 

[edi + 9] , 

dx 

Store 10th digit and end marker 

in memory. 

pop 

ebx 



Restore register as per calling 

convention 

pop 

esi 



Restore register as per calling 

convention 

pop 

edi 



Restore register as per calling 

convention 

ret 

8 



Pop two DWORD arguments and return. 


} 

} 

Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros 

_declspec(naked) void _stdcall uint_to_ascii_nlz(char *sptr, unsigned int x) 

{ 

_asm { 


push 

edi 


Save as per calling conventions. 

push 

ebx 


Save as per calling conventions. 

mov 

edi, 

[esp+12] 

sptr 

mov 

eax, 

[esp+16] 

X 

mov 

ecx, 

eax 

Save original argument. 

mov 

edx, 

8 97 05F4lh 

le-9 * 2^61 rounded 

mul 

edx 


Divide by le9 by multiplying with reciprocal. 

add 

eax, 

eax 

Round division result. 

adc 

edx, 

0 

EDX[31-29] = argument / le9 

shr 

edx, 

29 

Leading decimal digit, 0...4 

mov 

eax, 

edx 

Leading digit 

mov 

ebx, 

edx 

Initialize digit accumulator with 




leading digit. 

imul 

eax, 

1000000000 

Leading digit * le9 

sub 

ecx, 

eax 

Subtract (leading digit * le9) from argument. 

or 

dl, ' 

0 ' 

Convert leading digit to ASCII. 

mov 

[edi] 

, dl 

Store leading digit. 

cmp 

ebx, 

1 

Any nonzero digit yet? 

sbb 

edi, 

-1 

Yes, increment ptr. No, keep old ptr. 

mov 

eax, 

ecx 

Get reduced argument < le9. 

mov 

edx, 

0abcc7712h 

2^28 / le8 * 2 A 30 rounded up 

mul 

edx 


Divide reduced 

shr 

eax, 

30 

argument < le9 by le8. 

lea 

edx, 

[eax+4*edx+l] 

converting it into 4.28 fixed-point 

mov 

eax, 

edx 

format such that 1.0 = 2 A 28. 

shr 

eax, 

28 

Next digit 

and 

edx, 

Offfffffh 

Fraction part 

or 

ebx, 

eax 

Accumulate next digit. 

or 

eax, 

' 0 ' 

Convert digit to ASCII. 

mov 

[edi] 

, al 

Store digit in memory. 

lea 

eax, 

[edx*4+edx] 

5 * fraction, new digit EAX[31-27] 
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lea 

edx, 

[edx*4+edx] 

; 5 * fraction, new fraction EDX[26-0] 

cmp 

ebx, 

1 

; Any nonzero digit yet? 


sbb 

edi, 

-1 

; Yes, increment ptr. No, 

keep old ptr. 

shr 

eax, 

27 

; Next digit 


and 

edx, 

07ffffffh 

; Fraction part 


or 

ebx, 

eax 

; Accumulate next digit. 


or 

eax, 

' 0 ' 

; Convert digit to ASCII. 


mov 

[edi] 

, al 

; Store digit in memory. 


lea 

eax, 

[edx*4+edx] 

; 5 * fraction, new digit 

EAX[31-26] 

lea 

edx, 

[edx*4+edx] 

; 5 * fraction, new fraction EDX [25-0] 

cmp 

ebx, 

1 

; Any nonzero digit yet? 


sbb 

edi, 

-1 

; Yes, increment ptr. No, 

keep old ptr. 

shr 

eax, 

26 

; Next digit 


and 

edx, 

03ffffffh 

; Fraction part 


or 

ebx, 

eax 

,- Accumulate next digit. 


or 

eax, 

' 0 ' 

; Convert digit to ASCII. 


mov 

[edi] 

, al 

; Store digit in memory. 


lea 

eax, 

[edx*4+edx] 

; 5 * fraction, new digit 

EAX[31-25] 

lea 

edx, 

[edx*4+edx] 

; 5 * fraction, new fraction EDX [24-0] 

cmp 

ebx, 

1 

; Any nonzero digit yet? 


sbb 

edi, 

-1 

; Yes, increment ptr. No, 

keep old ptr. 

shr 

eax, 

25 

; Next digit 


and 

edx, 

Olffffffh 

; Fraction part 


or 

ebx, 

eax 

; Accumulate next digit. 


or 

eax, 

' 0 ' 

; Convert digit to ASCII. 


mov 

[edi] 

, al 

; Store digit in memory. 


lea 

eax, 

[edx*4+edx] 

; 5 * fraction, new digit 

EAX[31-24] 

lea 

edx, 

[edx*4+edx] 

; 5 * fraction, new fraction EDX [23-0] 

cmp 

ebx, 

1 

; Any nonzero digit yet? 


sbb 

edi, 

-1 

; Yes, increment ptr, No, 

keep old ptr. 

shr 

eax, 

24 

; Next digit 


and 

edx, 

OOffffffh 

; Fraction part 


or 

ebx, 

eax 

; Accumulate next digit. 


or 

eax, 

' 0 ' 

; Convert digit to ASCII. 


mov 

[edi] 

, al 

; Store digit in memory. 


lea 

eax, 

[edx*4+edx] 

; 5 * fraction, new digit 

EAX[31-23] 

lea 

edx, 

[edx*4+edx] 

; 5 * fraction, new fraction EDX[31-23] 

cmp 

ebx, 

1 

; Any nonzero digit yet? 


sbb 

edi, 

-1 

; Yes, increment ptr. No, 

keep old ptr. 

shr 

eax, 

23 

; Next digit 


and 

edx, 

007fffffh 

; Fraction part 


or 

ebx, 

eax 

; Accumulate next digit. 


or 

eax, 

' 0 ' 

; Convert digit to ASCII. 


mov 

[edi] 

, al 

; Store digit out to memory. 

lea 

eax, 

[edx*4+edx] 

; 5 * fraction, new digit 

EAX[31-22] 

lea 

edx, 

[edx*4+edx] 

; 5 * fraction, new fraction EDX[22-0] 

cmp 

ebx, 

1 

; Any nonzero digit yet? 


sbb 

edi, 

-1 

; Yes, increment ptr. No, 

keep old ptr. 

shr 

eax, 

22 

; Next digit 


and 

edx, 

0 03 f f f f f h 

; Fraction part 


OR 

ebx, 

eax 

; Accumulate next digit. 
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or 

eax, 

' 0 ' 

Convert digit to ASCII. 


mov 

[edi] 

, al 

Store digit in memory. 


lea 

eax, 

[edx*4+edx] 

5 * fraction, new digit 

EAX[31-21] 

lea 

edx, 

[edx*4+edx] 

5 * fraction, new fraction EDX [21-0] 

cmp 

ebx, 

1 

Any nonzero digit yet? 


sbb 

edi, 

-1 

Yes, increment ptr. No, 

keep old ptr. 

shr 

eax, 

21 

Next digit 


and 

edx, 

OOlfffffh 

Fraction part 


or 

ebx, 

eax 

Accumulate next digit. 


or 

eax, 

' 0 ' 

Convert digit to ASCII. 


mov 

[edi] 

, al 

Store digit in memory. 


lea 

eax, 

[edx*4+edx] 

5 * fraction, new digit 

EAX[31-20] 

cmp 

ebx, 

1 

Any nonzero digit yet? 


sbb 

edi, 

-1 

Yes, increment ptr. No, 

keep old ptr. 

shr 

eax, 

20 

Next digit 


or 

eax, 

' 0 ' 

Convert digit to ASCII. 


mov 

[edi] 

, ax 

Store last digit and end marker in memory. 

pop 

ebx 


Restore register as per 

calling convention 

pop 

edi 


Restore register as per 

calling convention 

ret 

8 


Pop two DWORD arguments 

and return. 


} 

} 


Chapter 8 


Integer Optimizations 


185 




AMpg _ 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


8.8 Derivation of Algorithm, Multiplier, and Shift 
Factor for Integer Division by Constants 

The following examples illustrate the derivation of algorithm, multiplier and shift factor for signed 
and unsigned integer division. 

Unsigned Integer Division 

The utility udiv. exe was compiled from the code shown in this section. The utilities provided in this 
document are for reference only and are not supported by AMD. 

The following code derives the multiplier value used when performing integer division by constants. 
The code works for unsigned integer division and for odd divisors between 1 and 2 31 — 1, inclusive. 
For divisors of the form d= d' * 2n, the multiplier is the same as for d' and the shift factor is s + n. 


Example 

/* This program determines the algorithm (a), multiplier (m), and 

shift factor (s) to be used to accomplish *unsigned* division by 
a constant divisor. Compile with MSVC. 

*/ 

#include <stdio.h> 


typedef unsigned _int64 U64; 

typedef unsigned long U32; 


U32 log2(U32 i) 

{ 

U32 t = 0; 
i = i > > 1 ; 
while (i) { 
i = i > > 1 ; 
t + +; 

} 

return(t) ; 

} 


U32 resl, res2; 

U32 d, 1, s, m, a, 
U64 m_low, m_high, 

int main (void) 

{ 

fprintf(stderr, 
fprintf(stderr, 
fprintf(stderr, 


r, n, t ; 
j , k; 


" \n" ) ; 

"Unsigned division by constant\n"); 

"=============================\n.\n.") ; 
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fprintf(stderr, "enter divisor: "); 
scanf("%lu", &d); 
printf("\n"); 

if (d == 0) goto printed_code; 
if (d >= 0x80 0 0 0 0 0OUL) { 

printf("; dividend: register or memory location\n"); 
printf("\n"); 

printf("CMP dividend, 0%081Xh\n", d); 

printf("MOV EDX, 0\n"); 

printf("SBB EDX, -l\n"); 

printf("\n"); 

printf("; quotient now in EDX\n"); 
goto printed_code; 


/* Reduce divisor until it becomes odd. */ 

n = 0 ; 
t = d; 

while (! (t & 1) ) { 

t >>= 1; 
n++ ; 

} 

if (t == 1) { 
if (n == 0) { 

printf("; dividend: register or memory location\n"); 
printf("\n"); 

printf("MOV EDX, dividend\n", n); 

printf("\n"); 

printf("; quotient now in EDX\n"); 

} 

else { 

printf("; dividend: register or memory location\n"); 
printf("\n"); 

printf("SHR dividend, %d\n", n); 

printf("\n"); 

printf("; quotient replaced dividend\n"); 

} 

goto printed_code; 

} 

/* Generate m, s for algorithm 0. Based on: Granlund, T.; Montgomery, 
P.L.: "Division by Invariant Integers using Multiplication." 
SIGPLAN Notices, Vol. 29, June 1994, page 61. 

*/ 

1 = log2 (t) + 1 ; 
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j = ( ( (U64) (Oxffffffff) ) % ((U64) (t))) ; 

k = (((U64)(1)) << (32 + 1)) / ((U64)(Oxffffffff - j)); 

m_low = (((U64)(1)) << (32 + 1)) / t; 

m_high = ( ( ( (U64) (1) ) << (32 + 1) ) + k) / t; 

while (((m_low >> 1) < (m_high >> 1)) && (1 > 0)) { 

m_low = m_low >> 1; 
m_high = m_high >> 1; 

1=1-1; 

} 

if ((m_high >> 32) == 0) { 

m = ((U32)(m_high)); 

s = 1; 
a = 0 ; 

} 

/* Generate m and s for algorithm 1. Based on: Magenheimer, D.J.; et al: 
"Integer Multiplication and Division on the HP Precision Architecture." 
IEEE Transactions on Computers, Vol. 37, No. 8, August 1988, page 980. 

*/ 

else { 

s = log2(t); 

m_low = ( ( (U64) (1) ) << (32 + s) ) / ((U64)(t)); 
r = ( (U32) ( ( ( (U64) (1) ) << (32 + s) ) % ( (U64) (t) ) ) ) ; 
m = (r < ( (t >> 1) +1)) ? ( (U32) (m_low) ) : ( (U32) (m_low) ) + 1; 
a = 1; 

} 

/* Reduce multiplier for either algorithm to smallest possible. */ 

while (!(m & 1)) { 

m = m > > 1 ; 
s - - ; 

} 

/* Adjust multiplier for reduction of even divisors. */ 
s += n; 
if (a) { 

printf("; dividend: register other than EAX or memory location\n"); 
printf("\n"); 

printf("MOV EAX, 0%081Xh\n", m); 

printf("MUL dividend\n"); 

printf("ADD EAX, 0%081Xh\n", m); 

printf("ADC EDX, 0\n"); 

if (s) printf("SHR EDX, %d\n", s); 
printf("\n"); 

printf("; quotient now in EDX\n"); 

} 

else { 
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printf("; dividend: register other than EAX or memory location\n"); 
printf("\n"); 

printf("MOV EAX, 0%081Xh\n", m); 

printf("MUL dividend\n"); 

if (s) printf("SHR EDX, %d\n", s); 
printf("\n"); 

printf("; quotient now in EDX\n"); 

} 

printed_code: 

fprintf(stderr, "\n"); 
exit(0); 

return(0); 

} 

Signed integer Division 

The utility sdiv.exe was compiled using the following code. The utilities provided in this document 
are for reference only and are not supported by AMD. 

Example Code 

/* This program determines the algorithm (a), multiplier (m), and 
shift factor (s) to be used to accomplish *signed* division by 
a constant divisor. Compile with MSVC. 

*/ 

#include <stdio.h> 

typedef unsigned _int64 U64; 

typedef unsigned long U32; 

U32 log2(U32 i) 

{ 

U32 t = 0; 
i = i > > 1 ; 
while (i) { 
i = i > > 1 ; 
t + + ; 

} 

return(t) ; 

} 

long e; 

U32 resl, res2; 

U32 oa, os, om; 

U32 d, 1, s, m, a, r, t; 

U64 m_low, m_high, j, k; 
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int main(void) 

{ 

fprintf(stderr, "\n"); 

fprintf(stderr, "Signed division by constant\n"); 
fprintf(stderr, "===========================\n\n"); 

fprintf(stderr, "enter divisor: "); 
scanf("%ld", &d); 
fprintf(stderr, "\n"); 

e = d; 

d = labs(d); 

if (d == 0) goto printed_code; 
if (e == (-1)) { 

printf("; dividend: register or memory location\n"); 
printf("\n"); 

printf("NEG dividend\n"); 
printf("\n"); 

printf("; quotient replaced dividend\n"); 
goto printed_code ,- 

} 

if (d == 2) { 

printf ("; dividend expected in EAX\n"); 
printf("\n"); 

printf("CMP EAX, 080000000h\n"); 

printf("SBB EAX, -l\n"); 

printf("SAR EAX, l\n"); 

if (e < 0) printf("NEG EAX\n"); 
printf("\n"); 

printf("; quotient now in EAX\n"); 
goto printed_code ,- 

} 

if ( ! (d Sc (d - 1) ) ) { 

printf("; dividend expected in EAX\n"); 
printf("\n"); 
printf("CDQ\n") ; 

printf("AND EDX, 0%081Xh\n", (d-1)); 

printf("ADD EAX, EDX\n"); 

if (log2(d)) printf("SAR EAX, %d\n", log2(d)); 

if (e < 0) printf("NEG EAX\n"); 

printf("\n"); 

printf("; quotient now in EAX\n"); 
goto printed_code; 

} 

/* Determine algorithm (a), multiplier (m), and shift factor (s) for 32-bit 
signed integer division. Based on: Granlund, T.; Montgomery, P.L.: 
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"Division by Invariant Integers using Multiplication". SIGPLAN Notices, 

Vol. 29, June 1994, page 61. 

*/ 

1 = log2(d); 

j = (( (U64) (0x80000000) ) % ((U64) (d) ) ) ; 

k = (((U64)(1)) << (32 + 1)) / ((U64)(0x80000000 - j ) ) ; 

m_low = (((U64)(1)) << (32 + 1)) / d; 

m_high = ( ( ( (U64) (1) ) << (32 + 1) ) + k) / d; 

while (((m_low >> 1) < (m_high >> 1)) && (1 > 0)) { 
m_low = m_low >> 1; 
m_high = m_high >> 1; 

1=1-1; 

} 

m = ((U32)(m_high)); 

s = 1; 

a = (m_high >> 31) ? 1 : 0; 

if (a) { 

printf("; dividend: memory location or register other than EAX or EDX\n"); 
printf("\n"); 

printf("MOV EAX, 0%08LXh\n", m); 

printf("IMUL dividend\n"); 

printf("MOV EAX, dividend\n"); 

printf("ADD EDX, EAX\n"); 

if (s) printf("SAR EDX, %d\n", s); 

printf("SHR EAX, 31\n"); 

printf("ADD EDX, EAX\n"); 

if (e < 0) printf("NEG EDX\n"); 

printf("\n"); 

printf("; quotient now in EDX\n"); 

} 

else { 

printf("; dividend: memory location of register other than EAX or EDX\n"); 
printf("\n"); 

printf("MOV EAX, 0%08LXh\n", m); 

printf("IMUL dividend\n"); 
printf("MOV EAX, dividend\n"); 

if (s) printf("SAR EDX, %d\n", s); 
printf("SHR EAX, 31\n"); 

printf("ADD EDX, EAX\n"); 

if (e < 0) printf("NEG EDX\n"); 
printf("\n"); 

printf("; quotient now in EDX\n"); 

} 

printed_code: 

fprintf(stderr, "\n"); 

exit(0); 

} 
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8.9 Optimizing Integer Division 

Optimization 

When possible, use smaller data types for integer division. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Division by a 16-bit value is significantly faster than division by a 32-bit value—about a 26 clock 
latency versus 42. Likewise, division by a 32-bit value is faster than division by a 64-bit value—about 
42 clocks versus 74. Refer to IDIV in table 15. In algorithms in which integer division contributes a 
substantial component to performance, it may be beneficial to check whether using a smaller divide 
type is possible. Study the assembly language output generated by high-level language compilers to 
verify that the desired code is generated. Compilers often generate code that converts 16-bit types into 
32-bit values that are then used to perform 32-bit division, thus eliminating the advantage of using 16- 
bit integer types. If the compiler cannot be coerced into producing the desired code, then compiler 
intrinsics or assembly language are required. 
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Chapter 9 Optimizing with SIMD Instructions 


The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be used to encode 
floating-point and integer operation. 

• The SIMD instructions use a flat register file rather than the stack register file used by x87 
floating-point instructions. This allows arbitrary sequences of operations to map more efficiently 
to the instruction set. 

• Future processors with more or wider multipliers and adders will achieve better throughput using 
SSE and SSE2 instructions. (Today’s processors implement a 128-bit-wide SSE or SSE2 
operation as two 64-bit operations that are internally pipelined.) 

• SSE and SSE2 instructions work well in both 32-bit and 64-bit threads. 

The SIMD instructions provide a theoretical single-precision peak throughput of two additions and 
two multiplications per clock cycle, whereas x87 instructions can only sustain one addition and one 
multiplication per clock cycle. The SSE2 and x87 double-precision peak throughput is the same, but 
SSE2 instructions provide better code density. 

This chapter covers the following topics: 


Topic 

Page 

Ensure All Packed Floating-Point Data are Aligned 

195 

Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and MOVLPS 

When Loading Data from Memory 

196 

Structuring Code with Prefetch Instructions to Hide Memory Latency 

200 

Avoid Moving Data Directly Between General-Purpose and MMX™ Registers 

206 

Use MMX™ Instructions to Construct Fast Block-Copy Routines in 32-Bit Mode 

207 

Passing Data between MMX™ and 3DNowl™ Instructions 

208 

Storing Floating-Point Data in MMX™ Registers 

209 

EMMS and FEMMS Usage 

210 

Using SIMD Instructions for Fast Square Roots and Fast Reciprocal Square Roots 

211 

Use XOR Operations to Negate Operands of SSE, SSE2, and 3DNowl™ Instructions 

215 

Clearing MMX™ and XMM Registers with XOR Instructions 

216 

Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and 3DNowl™ 

Instructions 

217 

Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNowl™ 
Instructions 

218 

Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2, and 3DNowl™ 
Instructions 

218 
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Topic 

Page 

Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions 

221 

Optimized 4x4 Matrix Multiplication on 4 x 1 Column Vector Routines 

230 
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9.1 Ensure All Packed Floating-Point Data are Aligned 

Optimization 

Align all packed floating-point data on 16-byte boundaries. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Misaligned memory accesses reduce the available memory bandwidth and SSE and SSE2 instructions 
have shorter latencies when operating on aligned memory operands. 

Aligning data on 16-byte boundaries allows you to use the aligned load instructions (MOVAPS, 
MOVAPD, and MOVDQA), which move through the floating-point unit with shorter latencies and 
reduce the possibility of stalling addition or multiplication instructions that are dependent on the load 
data. 
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9.2 Improving Scalar SSE and SSE2 Floating-Point 
Performance with MOVLPD and MOVLPS When 
Loading Data from Memory 

Optimization 

Use the MOVLPS and MOVLPD instructions to move scalar floating-point data into the XMM 
registers prior to addition, multiplication, or other scalar instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale—Single Precision 

The MOVSS instruction is used to move scalar single-precision floating-point data into the XMM 
registers prior to addition (ADDSS) and multiplication (MULSS) or other scalar instructions. In 
addition to loading a 32-bit floating-point value into the XMM register, the MOVSS instruction clears 
the upper 96 bits of the register. Clearing part of the XMM register is an inefficiency that you can 
bypass by using the MOVLPS instruction. MOVLPS loads two floating-point values from memory 
without clearing the upper 64 bits of the XMM register. 

The latency of the MOVSS instruction is 3 cycles, whereas the latency of the MOVLPS instruction is 
2 cycles. The AMD Athlon™ 64 and AMD Opteron™ processors can perform two 64-bit loads per 
clock cycle. Two 64-bit MOVLPS loads can be issued in the same cycle, assuming the data is 8-byte 
aligned. Likewise, two MOVSS loads can be performed per cycle, but—unlike MOVLPS—additional 
operations that interfere with the MULSS and ADDSS instructions must be issued to clear the 
register. Using MOVLPS rather than MOVSS to load single-precision scalar data from memory on 
processor-limited floating-point-intensive code can result in significant performance increases. 

Consider the following caveats when using the MOVLPS instruction: 

• When accessing 4-byte-aligned addresses that are not 8-byte aligned, MOVLPS loads take an 
additional cycle. 

• Since MOVLPS loads two floating-point values instead of one, accessing the last floating-point 
value in a single-precision array attempts to load 4 bytes of additional memory directly after the 
end of the array, which may cause an access violation. To avoid an access violation, use MOVSS 
to access the last value in a single-precision array or store a dummy floating-point value at the end 
of the array. 
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• The statement movlps xmmi, mem64 marks the lower half of XMM1 as FPS (floating-point 
single-precision) but leaves the upper half of XMMI unchanged. If XMMI is later used in any 
instruction that uses the full 128 bits of XMMI, there can be a performance penalty if the top half 
is not also in FPS format. Examples of instructions that expect the full 128 bits of XMMI to be in 
FPS format are MOVAPS, ANDPS, ANDNPS, and ORPS. For more information on XMM- 
register data types, see “Half-Register Operations” on page 356. 

Rational—Double Precision 

The MOVLPD instruction does not necessitate clearing the upper 64 bits of an XMM register, as the 
MOVSD/MOVQ instructions do, upon loading 64 bits of floating-point data into the lower 64 bits of 
the XMM register. Using the MOVLPD instruction can significantly increase performance on 
processor-limited SSE2 scalar floating-point-intensive code. 

Consider the following caveat when using the MOVLPD instruction: 

• The statement movlpd xmmi, mem64 marks the lower half of XMMI as FPD (floating-point 
double-precision) but leaves the upper half of XMMI unchanged. If XMMI is later used in any 
instruction that uses the full 128 bits of XMMI, there can be a performance penalty if the top half 
is not also in FPD format. Examples of instructions that expect the full 128 bits of XMMI to be in 
FPD format are ANDPD, ANDNPD, and ORPD. For more information on XMM-register data 
types, see “Half-Register Operations” on page 356. 
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9.3 Use MOVLPx/MOVHPx Instructions for Unaligned 
Data Access 

Optimization 

When data alignment cannot be guaranteed, use MOVLPD/MOVHPD, MOVLPS/MOVHPS or 
MOVLPD/MOVHPD pairs in lieu of MOVUPD, MOVUPS or MOVDQU, respectively. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The MOVUPS, MOVUPD and MOVDQU instructions are VectorPath when one of the operands is a 
memory location. It is better to use one of the MOVLPx/MOVHPx or MOVQ/MOVHPD pairs. It is 
prefereable to load or store the 64-bit halves of an XMM register separately when the memory 
location cannot be guaranteed to be aligned. 
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9.4 Use MOVAPD and MOVAPS Instead of MOVUPD 
and MOVUPS 

Optimization 

For best performance use the aligned versions of these instructions when using a memory operand. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Both MOVUPS and MOVUPD are VectorPath instructions when one of the operands is a memory 
location. It is better to use MOVAPS and MOVAPD since they are both DirectPath Double decode 
types. Misaligned memory accesses also reduce the available memory bandwidth and SSE and SSE2 
instructions have shorter latencies when operating on aligned memory operands. Aligning data on 16- 
byte boundaries allows you to use the aligned load instructions (MOVAPS, MOVAPD, and 
MOVDQA), which move through the floating-point unit with shorter latencies and reduce the 
possibility of stalling addition or multiplication instructions that are dependent on the load data. 
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9.5 Structuring Code with Prefetch Instructions to 
Hide Memory Latency 

Optimization 

When utilizing prefetch instructions, attend to: 

• The time allotted (latency) for data to reach the processor between issuing a prefetch instruction 
and using the data. 

• Structuring the code to best take advantage of prefetching. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Prefetch instructions bring the cache line containing a specified memory location into the processor 
cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.) 
Prefetching hides the main memory load latency, which is typically many orders of magnitude larger 
than a processor clock cycle. 

There are two types of loops: 


Loop type 

Description 

Memory-limited 

Data can be processed and requested faster than it can be fetched from memory. 

Processor-limited 

Data can be requested and brought into the processor before it is needed because 
considerable processing occurs during each unrolled loop iteration. 


The example provided below illustrates the importance of the above considerations in an example that 
multiplies a double-precision 32 x 32 matrix A with another 32 x 32 transposed double-precision 
matrix, B ; the result is returned in another 32 x 32 transposed double-precision matrix, C . (The 
transposition of B and C is performed to efficiently access their elements because matrices in the C 
programming language are stored in row-major format. Doing the transposition in advance reduces 
the problem of matrix multiplication to one of computing several dot-products—one for each element 
of the results matrix, C . This “dotting” operation is implemented as the sum of pair-wise products of 
the elements of two equal-length vectors.) For this example, assume the processor clock speed is 
2 GHz, and the memory latency is 60 ns. In this example, the rows of matrix A are repeatedly 
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“dotted” with a column of B . Once this is done, the rows of matrix A are “dotted” with the next 
column of B , and the process is repeated through all the columns of B . 

From a performance standpoint, there are several caveats to recognize, as follows: 

• Once all the rows of A have been multiplied with the first column of B, all the rows of A are in the 
cache, and subsequent accesses to them do not cause cache misses. 

• The rows of B are brought into the cache by “dotting” the first four rows of A with each row of 

X 

B in the ctr_row_num for-loop. 

X 

• The elements of C are not initially in the cache, and every time a new set of four rows of A are 

X X 

“dotted” with a new row of B , the processor has to wait for C to arrive in the cache before the 
results can be written. 

You can address the last two caveats by prefetching to improve performance. However, to efficiently 
exploit prefetching, you must structure the code to issue the prefetch instructions such that: 

• Enough time is provided for memory requests sent out through prefetch requests to bring data into 
the processor’s cache before the data is needed. 

• The loops containing the prefetch instructions are ordered to issue sufficient prefetch instructions 
to fetch all the pertinent data. 

The matrix order of 32 is not a coincidence. A double-precision number consists of 8 bytes. Prefetch 
instructions bring memory into the processor in chunks called cache lines consisting of 64 bytes (or 
eight double-precision numbers). We need to issue four prefetch instructions to prefetch a row of B . 
Consequently, when multiplying all 32 rows of A with a particular column of B, we want to arrange 
the for-loop that cycles through the rows of A such that it is repeated four times. To achieve this, we 
need to dot eight rows of A with a row of B every time we pass through the ctr_row_num for-loop. 
Additionally, “dotting” eight rows of A upon a row of B produces eight doubles of C (that is, a full 
cache line). 

Assume it takes 60 ns to retrieve data from memory; then we must ensure that at least this much time 
elapses between issuing the prefetch instruction and the processor loading that data into its registers. 
The dot-product of eight rows of A with a row of B consists of 512 floating-point operations (dotting 
a single row of A with a row of B consists of 32 additions and 32 multiplications). The 
AMD Athlon, AMD Athlon 64, and AMD Opteron processors are capable of performing a maximum 
of two floating point operations per clock cycle; therefore, it takes the processor no less than 
256 clock cycles to process each ctr_row_num for-loop. 

Choosing a matrix order of 32 is convenient for these reasons: 

• All three matrices A, B , and C can fit into the processor’s 64-Kbyte LI data cache. 

• On a 2-GHz processor running at full floating-point utilization, 128 ns elapse during the 
256 clock cycles, considerably more than the 60 ns to retrieve the data from memory. 
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• The size of each row is an integer number of cache lines. 

A set of eight rows of A is dotted in pairs of four with B , and prefetches in each iteration of the 
ctr_row_num for-loop are issued to retrieve: 

• The cache line (or set of eight double-precision values) of C to be processed in the next iteration 
of the Ctr_row_num for-loop. 

• One quarter of the next row of B . 

X 

Including the prefetch to the rows of B increases performance by about 16%. Prefetching the 
elements of C increases performance by an additional 3% or so. 

Follow these guidelines when working with processor-limited loops: 

• Arrange your code with enough instructions between prefetches so that there is adequate time for 
the data to be retrieved. 

• Make sure the data that you are prefetching fits into the LI data cache and does not displace other 
data that is also being operated upon. For instance, choosing a larger matrix size might displace A 
if all three matrices cannot fit into the 64-Kbyte LI data cache. 

• Operate on data in chunks that are integer multiples of cache lines. 

Examples 

Double-Precision 32 x 32 Matrix Multiplication 

//****************-k*-k-k-k*-k*-k-k***-k*-k*-k*-k*-k*-k*-k*-k-k-k-k-k-k-k-k-k-k-k-k-k-k-k-k-k-k-k-k************** 

// This routine multiplies a 32x32 matrix A (stored in row-major format) upon 
// the transpose of a 32x32 matrix B (stored in row-major format) to get 
// the transpose of the resultant 32x32 matrix C. 

/I******************************************************************************* 
void matrix_multiply_32x32(double *A,double *Btranspose,double *Ctranspose) { 
int Ctr_8col_blck, Ctr_row_num, n; 

// These 4 pointers are used to address 4 consecutive rows of matrix A. 
double *AptrO, *Aptrl, *Aptr2, *Aptr3; 

// Pointers *Btr_ptr and *Ctr_ptr are used to address the column of B upon 
// which A is being multiplied and where the result C is placed. 

// Pointers *Bprefptr and *Cprefptr are used to address the next column 
// of B and the next elements of C to be calculated in advance 
// using prefetch instructions. 

double *Btr_ptr, *Ctr_ptr, *Btr_prefptr, *Ctr_prefptr; 

// Put the address of matrices B-tranpose and C-transpose into their 
// respective temporary pointers. 

Btr_ptr = Btranspose; Ctr_ptr = Ctranspose; 

// Shift the prefetch pointers to the next row of B-transpose and the 
// next set of 8 elements of C-transpose. (Each set of 8 doubles is 
// a 64-byte cache line if the addresses Btr_ptr and Ctr_ptr are aligned 
// in memory on 64-byte boundaries.) 
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Btr_prefptr = Btr_ptr + 32; Ctr_prefptr = Ctr_ptr + 8; 

// This loop cycles through the rows of the TRANSPOSED C matrix. A row 
// of C-transpose is calculated by the code in this loop and then the 
// next row is determined in the following loop iteration. There are 
// 32 rows in C-transpose. 

for (Ctr_row_num = 0; Ctr_row_num < 32; Ctr_row_num++) { 

// Assign pointers to 4 consecutive rows of A by using the 
// address of matrix A passed into the function: 

AptrO = A; 

Aptrl = AptrO + 32; 

Aptr2 = AptrO + 64; 

Aptr3 = AptrO + 96; 

// This loop contains code that "dots" 8 rows of A upon the present row 
// of B-transpose. By looping 4 times, all 32 rows of A are multiplied 
// upon the present column of B-transpose. 

for (Ctr_8col_blck = 0; Ctr_8col_blck < 4; Ctr_8col_blck++) { 

// This instruction prefetches 1/4 of the next column of B-transpose 
// upon which matrix A needs to be multiplied. The loop within which 
// this code resides is executed 4 times, and by incrementing 
// Btr_prefptr (the ptr to the address of B transpose to be 
// prefetched) by 8 doubles (or 64 bytes, or 1 cache line) the entire 

// contents of the next row of B-transpose are brought to the 

// processor in advance when Ctr_row_num in the outer loop is 
// incremented 

_mm_prefetch(&Btr_prefptr[0] , 2) ; 

// This loop below "dots" 4 consecutive rows of A upon a row of 
// B-transpose by looping 8 times through code that multiplies and 

// accumulates the products of 4 elements of A's rows with 4 

// elements of B-transpose 1 s column, 
for (n = 0; n < 8; n++) { 

Ctr_ptr[0] += AptrO[0]*Btr_ptr[0] + AptrO[1]*Btr_ptr[1] + 

AptrO[2]*Btr_ptr[2] + AptrO[3]*Btr_ptr[3]; 

Ctr_ptr[l] += Aptrl[0]*Btr_ptr[0] + Aptrl[1]*Btr_ptr[1] + 

Aptrl[2]*Btr_ptr[2] + Aptrl[3]*Btr_ptr[3]; 

Ctr_ptr[2] += Aptr2[0]*Btr_ptr[0] + Aptr2[1]*Btr_ptr[1] + 

Aptr2[2]*Btr_ptr[2] + Aptr2[3]*Btr_ptr[3]; 

Ctr_ptr[3] += Aptr3[0]*Btr_ptr[0] + Aptr3[1]*Btr_ptr[1] + 

Aptr3[2]*Btr_ptr[2] + Aptr3[3]*Btr_ptr[3]; 

// Increment pointers to B transpose's column and A's rows to 
// the next 4 elements to be multiplied and accumulated. 

Btr_ptr += 4; 

AptrO += 4; 

Aptrl += 4; 

Aptr2 += 4; 

Aptr3 += 4; 

} 

// The pointer to C-transpose is incremented by 4 doubles to 
// address the next 4 elements of C-transpose's row to be determined. 
Ctr_ptr += 4; 

// The pointer to B transpose points to the end of the present 
// row. We need to subtract 32 doubles so Btr_ptr points 
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// again to the top of the column for the next dot-product of 
// 4 rows of A upon B-transpose's row vector. 

Btr_ptr -= 32; 

// The addresses AptrO, Aptrl, Aptr2, and Aptr3 need to be 
// incremented to the next block of 4 rows of A to be multiplied 
// upon B's column. 4 rows of A are 128 doubles in size, and in 
// the n-loop above they were incremented by 32 already, so they 
// must be incremented an additional 96 to point to the next 
// 4 rows of A to be dotted. 

AptrO += 96; 

Aptrl += 96; 

Aptr2 += 96; 

Aptr3 += 96; 

_mm_prefetch(&Ctr_prefptr[0], 2); 

// This loop below "dots" 4 consecutive rows of A upon a row 
// of B-transpose by looping 8 times through code that 
// multiplies and accumulates the products of 4 elements of A's 
// rows with 4 elements of B-transpose's column, 
for (n = 0; n < 8; n++) { 

Ctr_ptr[0] += AptrO[0]*Btr_ptr[0] + AptrO[1]*Btr_ptr[1] + 

AptrO[2]*Btr_ptr[2] + AptrO[3]*Btr_ptr[3]; 
Ctr_ptr[l] += Aptrl[0]*Btr_ptr[0] + Aptrl[1]*Btr_ptr[1] + 

Aptrl[2]*Btr_ptr[2] + Aptrl[3]*Btr_ptr[3]; 
Ctr_ptr[2] += Aptr2[0]*Btr_ptr[0] + Aptr2[1]*Btr_ptr[1] + 

Aptr2[2]*Btr_ptr[2] + Aptr2[3]*Btr_ptr[3]; 
Ctr_ptr[3] += Aptr3[0]*Btr_ptr[0] + Aptr3[1]*Btr_ptr[1] + 

Aptr3[2]*Btr_ptr[2] + Aptr3[3]*Btr_ptr[3]; 

// Increment pointers to B transpose's column and A's rows to 
// the next 4 elements to be multiplied and accumulated. 

Btr_ptr += 4; 

AptrO += 4; 

Aptrl += 4; 

Aptr2 += 4; 

Aptr3 += 4; 

} 

// The addresses to prefetch in B-transpose and C-transpose 
// are incremented by 8 doubles, or 64 bytes, or 1 cache line. 

// Each loop of the 4 loops of Ctr_8col_blck above brings in a 
// new set of 8 doubles and after 4 loops the full column of the 
// next column of B and the next set of 8 elements of C to be 
// determined are also brought into the cache. 

Btr_prefptr += 8; 

Ctr_prefptr += 8; 

// The pointer to C-transpose is incremented by 4 doubles 
// to address the next 4 elements of C-transpose's row to be 
// determined. 

Ctr_ptr += 4; 

// The pointer to B-transpose points to the end of the present 
// row. We need to subtract 32 doubles so Btr_ptr points again 
// to the top of the column for the next dot-product of 4 rows of A 
// upon B-transpose's row vector 
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Btr_ptr -= 32; 

// The addresses AptrO, Aptrl, Aptr2, and Aptr3 need to be 
// incremented to the next block of 4 rows of A to be dotted 
// upon B's column. 4 rows of A are 128 doubles in size, and 
// in the n-loop above they were incremented by 32 already, so they 
// must be incremented an additional 96 to point to the 
// next 4 rows of A to be dotted. 

AptrO += 96; 

Aptrl += 96; 

Aptr2 += 96; 

Aptr3 += 96; 

} 

// Pointer to B-transpose is incremented by a row so as to point 
// to the next row of B upon which matrix A needs to be multiplied. 
Btr_ptr += 32; 

} 

} 
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9.6 Avoid Moving Data Directly Between 

General-Purpose and MMX™ Registers 

Optimization 

Avoid moving data directly between general-purpose registers and MMX™ registers; this operation 
requires the use of the MOVD instruction. If it is absolutely necessary to move data between these 
two types of registers, use separate store and load instructions to move the data from the source 
register to a temporary location in memory and then from memory into the destination register, 
separating the store and the load by at least 10 instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The register-to-register forms of the MOVD instruction are either VectorPath or DirectPath Double 
instructions. When compared with DirectPath Single instructions, VectorPath and DirectPath Double 
instructions have comparatively longer execution latencies. In addition, VectorPath instructions 
prevent the processor from simultaneously decoding other insructions. 

Example 

Avoid code like this, which copies a value directly from an MMX register to a general-purpose 
register: 

movd eax, mm2 

If it is absolutely necessary to copy a value from an MMX register to a general-purpose register (or 
vice versa), use separate store and load instructions, separating them by at least 10 instructions: 

movd DWORD PTR temp, mm2 ; Store the value in memory. 

; At least 10 other instructions appear here. 

mov eax, DWORD PTR temp ; Load the value from memory. 
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9.7 Use MMX™ Instructions to Construct Fast Block- 
Copy Routines in 32-Bit Mode 

Optimization 

Use MMX instructions when moving integer data in a block-copy routine. 

Application 

This optimization applies to: 

• 32-bit software 

Rationale 

MMX instructions relieve the high register pressure typical of x86 code because of the small register 
file. 

In addition, MMX instructions increase the available parallelism on AMD Athlon 64 and 
AMD Opteron processors because they use both sides (integer and floating-point) of the execution 
pipeline. For an example of how to move a large quad word-aligned block of data using the MMX 
MOVQ instruction, see "Optimizing Main Memory Performance for Large Arrays" in the 
AMD Athlon™ Processor x86 Code Optimization Guide (order# 22007). 

If a block-copy routine is not used, do not move integer data through MMX registers. 
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9.8 Passing Data between MMX™ and 3DNow!™ 
Instructions 

Optimization 

Avoid passing data between MMX and 3DNow!™ instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rational 

The AMD Athlon 64 and AMD Opteron processors do not support bypassing register data between 
MMX and 3DNow! instructions. One additional cycle of latency is added to a dependency chain 
whenever data is passed between these instruction groups in either direction. 
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9.9 Storing Floating-Point Data in MMX™ Registers 

Optimization 

Avoid storing floating-point data in MMX registers unless using 3DNow! instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Using MOVDQ2Q or MOVQ2DQ to shuffle integer data between MMX and XMM registers is useful 
to relieve register pressure; however, doing so with floating-point data can impact performance. The 
impact is greater if the floating-point data is denormalized. 


Chapter 9 


Optimizing with SIMD Instructions 


209 



AMpg _ 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


9.10 EMMS and FEMMS Usage 

Optimization 

Use FEMMS or EMMS to clean up the register file between an x87 instruction and a following 
MMX, 3DNow!, or Enhanced 3DNow! instruction or vice versa. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Use either the FEMMS or the EMMS instruction when switching between the x87 floating-point unit 
and MMX, 3DNow!, or Enhanced 3DNow! instructions. The FEMMS instruction is aliased to the 
EMMS instruction on AMD Athlon 64 and AMD Opteron processors. Both instructions convert to an 
internal NOP instruction in AMD Athlon 64 and AMD Opteron processors. The FEMMS instruction 
is provided to help ensure that code written for previous generations of AMD processors runs 
correctly. 

There is no penalty for switching between the x87 floating-point instructions and 3DNow! (or MMX) 
instructions in the processor. The MMX, 3DNow!, and Enhanced 3DNow! instructions are designed 
to be used concurrently; therefore, no delimiting cleanup operations are required when switching 
between them. However, x87 and 3DNow 1/Enhanced 3DNow 1/MMX instructions share the same 
architectural registers, so there is no easy way to use them concurrently without cleaning up the 
register file in between by using FEMMS or EMMS. For more information, see AMD64 Architecture 
Programmer’s Manual Volume 1: Application Programming, order# 24592. 
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9.11 Using SIMD Instructions for Fast Square Roots 
and Fast Reciprocal Square Roots 

Optimization 

Use SIMD vectorized square root (SQRTPS) and reciprocation (RCCPS) instructions to calculate 
square roots and reciprocal square roots of single-precision numbers. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

SIMD instructions exist for performing vectorized square root and reciprocation of single-precision 
numbers. These operations are often used in multimedia applications and also can be utilized in 
scientific arenas, such as molecular dynamics simulations. 

Example 

The following function highlights the use of both the vectorized reciprocal and square-root SSE 
instructions: 

; reciprocal_sqrt_sse(float *r, float *rcp_sqrt_r, int num_points); 

I 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c reciprocal_sqrt_sse.asm 

I 

.586 
. K3D 
. XMM 

_TEXT SEGMENT 

PUBLIC _reciprocal_sqrt_sse 

_reciprocal_sqrt_sse PROC NEAR 


INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS 
ENTERED. 

REGISTERS EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
push ebp 
mov ebp, esp 


; Parameters passed into routine: 
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[ebp+8] = ->r 

[ebp+12] = ->rcp_sqrt_r 
[ebp+16] = num_points 


push ebx 
push esi 
push edi 


THE FIRST 3 ASM LINES BELOW LOAD THE FUNCTION'S ARGUMENTS INTO GENERAL-PURPOSE 


REGISTERS (GPRS) 

esi = address of "r"'s to calculate the reciprocal square root of 
edi = address of "rcp_sqrt_r"'s to store reciprocal square root to 
ecx = num_points 

mov esi, [ebp+8] 

ESI = ->r 

mov edi, [ebp+12] 

EDI = ->rcp_sqrt_r 

mov ecx, [ebp+16] 

ECX = num points 

mov edx,ecx 

EDX = num points 

mov eax,ecx 

EAX = num points 

shl edx,2 

EDX = 4*num_points 

shr eax,4 

EAX = num points/16 

add edi,edx 

EDI = -> end of "r" 

add esi,edx 

EAX = -> end of "rep sqrt r" 

neg ecx 

ECX = -# quadwords of vertices to rotate 

or eax,eax 

If num_points/16 = 0, then skip 


reciprocal square root. 

jz skip_recprcl_sqrt_4xloop 

Unroll loop by 4 to work 


on 16 floats at a time. 


THIS LOOP RECIPROCATES AND SQUARE ROOTS 16 FLOATING-POINT NUMBERS EACH 
LOOP ITERATION AND WORDS WITH THOSE ELEMENTS OF "r" THAT OCCUPY A 
FULL CACHELINE 


ALIGN 16 

reciprocal_sqrt_4xloop: 

prefetchnta [esi+4*ecx+256] 


movaps 

xmmO, 

[esi+4*ecx] 

sqrtps 

xmmO, 

xmmO 

repps 

xmmO, 

xmmO 

movaps 

xmml, 

[esi+4*ecx+16] 

sqrtps 

xmml, 

xmml 

repps 

xmml, 

xmml 

movaps 

xmm2, 

[esi+4*ecx+32] 

sqrtps 

xmm2, 

xmm2 

repps 

xmm2, 

xmm2 

movaps 

xmm3, 

[esi+4*ecx+48] 

sqrtps 

xmm3, 

xmm3 

repps 

xmm3, 

xmm3 

movntps 

[edi+4 

: *ecx] , xmmO 


Align address of loop to a 16-byte boundary. 

Prefetch the elements "r" 4 cache lines 
ahead to reciprocate and squareroot 4 loops 
from now. 

XMM0=[r3,r2,rl,r0] 

XMM0=[sqrtr3,sqrtr2,sqrtrO,sqrtrO] 

XMM0=[l/sqrtr3,1/sqrtr2,1/sqrtrO,1/sqrtrO] 
XMM1=[r7,r6,r5,r4] 

XMM1=[sqrtr7,sqrtr6,sqrtr5,sqrtr4] 

XMM1=[l/sqrtr7,1/sqrtr6,1/sqrtr5,l/sqrtr4] 
XMM2=[rll,rlO,r9,r8] 

XMM2=[sqrtrll,sqrtrlO,sqrtr9,sqrtr8] 

XMM2=[1/sqrtrll,1/sqrtrlO,1/sqrtr9,1/sqrtr8] 
XMM2=[rl5,rl4,rl3,rl2] 

XMM2=[sqrtrl5,sqrtrl4,sqrtrl3,sqrtrl2] 

XMM2=[l/sqrtrl5,1/sqrtrl4,1/sqrtrl3,l/sqrtrl2] 
Store reciprocal square root to rcp_sqrt_r. 
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movntps [edi+4*ecx+16], xmml ; Store reciprocal square root to rcp_sqrt_r. 

movntps [edi+4*ecx+32], xram2 ; Store reciprocal square root to rcp_sqrt_r. 

movntps [edi+4*ecx+48], xmm3 ; Store reciprocal square root to rcp_sqrt_r. 

add ecx, 16 ; Decrement the # of reciprocal square 

; roots to calculate by 16. 

dec eax ; Decrement # of 16 float reciprocal square 

; root loops to perform by 1. 
jnz reciprocal_sqrt_4xloop 

jmp skip_recprcl_sqrt_4xloop ; Jump into loop to calculate reciprocal 

; square root of floats that don't 
; occupy a full cache line. 


THIS LOOP RECIPROCATES AND SQUARE ROOTS 1 FLOATING POINT NUMBER EACH 
LOOP ITERATION 


ALIGN 16 

reciprocal_sqrt_lxloop: 

movss xmmO, [esi+4*ecx] 
sqrtss xmmO, xmmO 
rcpss xmmO, xmmO 
movss [edi+4*ecx], xmmO 
inc ecx 

skip_recprcl_sqrt_4xloop: 
or ecx, ecx 

jnz reciprocal_sqrt_lxloop 

sfence ; Finish all memory writes. 


INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE 
WAS ENTERED. 

REGISTERS EAX, ECX, AND EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED, 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
pop edi 
pop esi 
pop ebx 
mov esp,ebp 
pop ebp 


ret 

_reciprocal_sqrt_sse ENDP 

_TEXT ENDS 

END 

The preceding code illustrates the use of separate loops for optimal performance. The loop titled 
reciprocal_sqrt_4xloop works with 16 floating-point numbers in each iteration and is unrolled to 
keep the processor busy by masking the latencies of the reciprocal and square-root instructions. In 


Align address of loop to a 16-byte boundary. 

XMM0=[,,,rO] 

XMM0=[,,,sqrt(rO)] 

XMM0=[,,,1/sqrt(rO) ] 

Store reciprocal square root to rcp_sqrt_r. 
Decrement the # of reciprocal square roots 
to calculate. 

If ECX != 0, then calculate the reciprocal 
square root of another float. 
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general, unrolling loops improves performance by providing opportunities for the processor to work 
on data pertaining to the next loop iteration while waiting for the result of an operation from the 
previous iteration. The reciprocal_sqrt_ixloop loop performs the reciprocation and square root 
on the remaining elements that do not form a full segment of 16 floating-point values. In this chapter, 
the previous function is the only example that handles any vector stream of num_points size. This is 
done to preserve space, but all examples in this chapter can be modified in a similar manner and used 
universally. 

Additionally, the previous SSE function makes use of the PREFETCHNTA instruction to reduce 
cache latency. The unrolled loop reciprocal_sqrt_4xloop was chosen to work with 64 bytes of 
data per iteration, which happens to be the size of one cache line (the term used to signify the 
quantum of data brought into the processor’s cache by a memory access, if the data does not reside 
there already). The prefetch causes the processor to load the floating-point operands of the reciprocal 
and square root operations for the next four loop iterations. While the processor works on the next 
three iterations, the data for the fourth iteration is sent to the processor. The processor does not have to 
wait while the aligned SSE instruction MOVAPS is fetched from memory before performing 
operations on the fourth iteration. This type of memory optimization can be very useful in gaming and 
high-performance computing, in which data sets are unlikely to reside in the processor’s cache. For 
example, in a simulation involving a million vertices or atoms in which the storage for their 
coordinates would require 12 bytes per vertex, the total space for the data would be more than 12 
Mbytes. 
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9.12 Use XOR Operations to Negate Operands of SSE, 
SSE2, and 3DNow!™ Instructions 

Optimization 

For AMD Athlon, AMD Athlon 64, and AMD Opteron processors, use instructions that perform 
XOR operations (PXOR, XORPS, and XORPD) instead of multiplication instructions to change the 
sign bit of operands of SSE , SSE2, and 3DNow! instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

On the AMD Athlon 64 and AMD Opteron processors, using XOR-type instructions allows for more 
parallelism, as these instructions can execute in either the FADD or FMUL pipe of the floating-point 
unit. 

Single Precision 

For single-precision, you can use either 3DNow! or SSE SIMD XOR operations. The latency of 
multiplying by -1.0 in 3DNow! is 4 cycles, while the latency of using the PXOR instruction is only 
2 cycles. Similarly, the latency of the MUFPS instruction is 5 cycles, while the latency of the XORPS 
instruction is 3 cycles. The following code example illustrates how to toggle the sign bit of a number 
using 3DNow! instructions: 

signmask DQ 8000000080000000h 

pxor mmO, [signmask] ; Toggle sign bits of both floats. 

This example does the same thing using SSE instructions: 

signmask DQ 8000000080000000h,8000000080000000h 

xorps xmmO, [signmask] ; Toggle sign bits of all four floats. 

Double Precision 

To perform double-precision arithmetic, you can use the XORPD instruction—similar to the single¬ 
precision example—to flip the sign of packed double-precision floating-point operands. The XORPD 
instruction takes 3 cycles to execute, whereas the MULPD instruction requires 5 cycles. 

signmask DQ 8000000000000000h,8000000000000000h 

xorpd xmmO, [signmask] ; Toggle sign bit of both doubles. 
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9.13 Clearing MMX™ and XMM Registers with XOR 
Instructions 

Optimization 

Use instructions that perform XOR operations (PXOR, XORPS, and XORPD) to clear all the bits in 
MMX and XMM registers. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The latency of the MMX XOR instruction (PXOR) is only 3 cycles and comparable to the 3 cycles 
required to load data, assuming it is in the LI data cache. The SSE and SSE2 XOR instructions 
(XORPS and XORPD, respectively) also have latencies of 3 cycles. 

Examples 

The following examples illustrate how to clear the bits in a register using the different exclusive-OR 
instructions: 

; MMX 

pxor mmO, mmO ; Clear the MMO register. 

; SSE 

xorps xmmO, xmmO ; Clear the XMMO register. 

; SSE2 

xorpd xmmO, xmmO ; Clear the XMMO register. 
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9.14 Finding the Floating-Point Absolute Value of 
Operands of SSE, SSE2, and 3DNow!™ 
Instructions 

Optimization 

Use instructions that perform AND operations (PAND, ANDPS, and ANDPD) to determine the 
absolute value of floating-point operands of SSE, SSE2, and 3DNow [instructions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The MMX PAND instruction has a latency of 2 cycles, whereas the SSE and SSE2 AND instructions 
(ANDPS and ANDPD, respectively) have latencies of 3 cycles. The following examples illustrate 
how to clear the sign bits: 

; 3DNow! 

absmask DQ 7FFFFFFF7FFFFFFFh 

pand mmO, [absmask] ; Clear the sign bits of both floats in MMO. 

; SSE 

absmask DQ 7FFFFFFF7FFFFFFFh,7FFFFFFF7FFFFFFFh 

andps xmmO, [absmask] ; Clear the sign bits of all four floats in XMMO. 

; SSE2 

absmask DQ 7FFFFFFFFFFFFFFFh,7FFFFFFFFFFFFFFFh 

andpd xmmO, [absmask] ; Clear the sign bits of both doubles in XMMO. 
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9.15 Accumulating Single-Precision Floating-Point 
Numbers Using SSE, SSE2, and 3DNow!™ 
Instructions 

Optimization 

In 32-bit software, use the 3DNow! PFACC instruction to perform complex-number multiplication, 
4x4 matrix multiplication, and dot products. For 64-bit software, careful selection of SSE 
instructions based on how the data is organized can also lead to more efficient code, as shown in the 
second example. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Though SSE, SSE2, and 3DNow! instrucitons are similar in the sense that they all have vectorized 
multiplication and addition, 3DNow! technology supports certain special instructions. One of these is 
the PFACC instruction. There are many instances where PFACC is useful, such as complex-number 
multiplication, 4x4 matrix multiplication, and dot products. 

Examples 

The following example accumulates two floats in two MMX registers: 

;accumulate_3dnow(float *a_and_b, float *c_and_d, float *aplusb_cplusd); 

l 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c accumulate_3dnow.asm 

I 

.586 
. K3D 
. XMM 

_TEXT SEGMENT 
PUBLIC _accumulate_3dnow 
accumulate 3dnow PROC NEAR 


INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS ENTERED 
REGISTERS (EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED) 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
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push ebp 
mov ebp, esp 


Parameters passed into routine: 
[ebp+8] = ->a_and_b 

[ebp+12] = ->c_and_d 
; [ebp+16] = ->aplusb_cplusd 


push ebx 
push esi 
push edi 


THE 4 ASM LINES BELOW LOAD THE FUNCTION'S ARGUMENTS INTO GENERAL-PURPOSE 
REGISTERS (GPRS) 

esi = starting address of 2 floats "a_and_b" 

edi = starting address of 2 floats "c_and_d" 

eax = starting address of 2 floats "aplusb_cplusd" 


mov esi, [ebp+8] ; esi = ->a_and_b 

mov edi, [ebp+12] ; edi = ->c_and_d 

mov eax, [ebp+16] ; eax = ->aplusb_cplusd 


ADD a AND b TOGETHER AND ALSO c AND d 


emms 

movq mmO, [esi] ; mmO = [b,a] 

movq mml, [edi] ; mml = [d,c] 

pfacc mmO, mml ; mmO = [c+d,b+a] 


; INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE 
; WAS ENTERED 

; REGISTERS (EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED) 

; WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 

pop edi 

pop esi 

pop ebx 

mov esp,ebp 

pop ebp 


ret 

_accumulate_3dnow ENDP 

_TEXT ENDS 

END 

The same operation can be performed using SSE instructions, but the data in the XMM registers must 
be rearranged. The next example loads four floating-point values into four XMM registers, XMM4- 
XMM7, and then rearranges and adds the values so as to accumulate the sum of each XMM register 
into a float in XMM1. 


; The instructions below take the 4 floats in each XMM register below: 
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; xmm4 

= [d, c, b, a] 





; xmm5 

= [D, C, B, A] 





; xmm6 

= [h,g,f,e] 





; xmm7 

= [H,G,F,E] 





r 

; and arranges them 

to 

look 

like: 

; xmm4 

= [E , e , A, a] 





; xmml 

= [F, f , B, b] 





; xmm2 

= [G, g, C, c] 





; xmm3 

= [H, h, D, d] 





movaps 

xmm3, xmm4 

/ 

xmm3 

1 

[d, c, b, a] 

movaps 

xmmO, xmm5 

1 

xmmO 

1 

[D, C, B , A] 

unpcklps 

xmm4, xmm6 

1 

xmm4 

1 

[f,b,e,a] 

unpckhps 

xmm3, xmm6 

/ 

xmm3 

1 

[h, d, g, c] 

movaps 

xmml, xmm4 

1 

xmml 

1 

[f,b,e,a] 

movaps 

xmm2, xmm3 

/ 

xmm2 

1 

[h, d, g, c] 

unpcklps 

xmm5, xmm7 

/ 

xmm5 

1 

[F, B , E , A] 

unpckhps 

xmmO, xmm7 

1 

xmmO 

1 

[H, D, G, C] 

unpcklps 

xmm4, xmm5 

1 

xmm4 

1 

[E, e. A, a] 

unpckhps 

xmml, xmm5 

/ 

xmml 

1 

[F, f , B, b] 

unpcklps 

xmm3, xmmO 

1 

xmm3 

1 

[G,g,C,c] 

unpckhps 

xmm2, xmmO 

/ 

xmm2 

1 

[H, h, D, d] 


Now if we compute the sum of these registers, we get the dot-product 
of the first row of A with vector X: 

a+b+c+d 

in the lower DWORD of the resultant XMM register. The dot-product of the 
second row is stored in the second DWORD and so on, such that: 

xmml = [V+X+Y+Z,v+x+y+z,A+B+C+D,a+b+c+d] 


addps xmml, xmm4 
addps xmm3, xmm2 
addps xmml, xmm3 


; xmml | [E+F,e+f,A+B,a+b] 

; xmm3 | [G+H,g+h,C+D,c+d] 

; xmml | [E+F+G+H,e+f+g+h,A+B+C+D,a+b+c+d] 
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9.16 Complex-Number Arithmetic Using SSE, SSE2, 
and 3DNow!™ Instructions 

Optimization 

Use vectorizing SSE, SSE2 and 3DNow! instructions to perform complex number calculations. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Complex numbers have a “real” part and an “imaginary” part (where the imaginary part is denoted by 
the letter i). For example, the complex number zl might have a real part equal to 4 and an imaginary 
part equal to 3, written as 4 + 3i. Multiplying and adding complex numbers is an integral part of 
digital signal processing. Complex number addition is illustrated here using two complex numbers, zl 
(4~+ 3i) and z2 (5 + 2i): 

zl +z2 = ( 4 + 3i) + (5 + 2i) = [4+5] + [3+2]i = 9 + 5i 
or: 

sum.real = zl.real + z2.real 
sum.imag = zl.imag + z2.imag 

Complex number addition is illustrated here using the same two complex numbers: 

zl +z2 = (4 + 3i)(5 + 2i) = [4 x 5 - 3 x 2] + [3 x 5 + 4 x 2]i = 14 + 23i 

or: 

product.real = zl.real * z2.real - zl.imag * z2.imag 
product.imag = zl.real * z2.imag + zl.imag * z2.real 

Complex numbers are stored as streams of two-element vectors, the two elements being the real and 
imaginary parts of the complex numbers. Addition of complex numbers can be achieved using 
vectorizing SSE or 3DNow [instructions, such as PFADD, ADDPS, and ADDPD. Multiplication of 
complex numbers is more involved. 

From the formulas for multiplication, the real and imaginary parts of one of the numbers needs to be 
interchanged, and, additionally, the products must be positively or negatively accumulated depending 
upon whether we are computing the imaginary or real portion of the product. 
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The following functions use SSE and 3DNow! instructions to illustrate complex multiplication of 
streams of complex numbers x [] and y [] stored in a product stream prod []. For these examples, 
assume that the sizes of x [] and y [] are even multiples of four. 

Examples 

Listing 25. Complex Multiplication of Streams of Complex Numbers (SSE) 

; cmplx_multiply_sse(float *x, float *y, int num_cmplx_elem, float *prod); 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c cmplx_multiply_sse.asm 

.586 
. K3D 
. XMM 

_TEXT SEGMENT 

PUBLIC _cmplx_multiply_sse 

_cmplx_multiply_sse PROC NEAR 


INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS ENTERED 
REGISTERS (EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED) 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
push ebp 
mov ebp, esp 


parameters passed into routine: 
[ebp+8] = ->x 

[ebp+12] = ->y 

[ebp+16] = num_cmplx_elem 
[ebp+20] = ->prod 


push ebx ; preserve contents in ebx,esi, and edi on stack 

push esi 
push edi 


THE CODE BELOW PUTS THE FLOATING POINT SIGN MASK 
[80000000000000080000000000000Oh] 

TO FLIP THE SIGN OF PACKED SINGLE PRECISION NUMBERS BY USING XORPS 


mov 

eax, 

esp 

; Copy stack pointer into EAX. 

mov 

ebx, 

16 


sub 

esp, 

32 

; Subtract 32 bytes from stack pointer. 

and 

eax, 

15 

; AND old stack pointer address with 15 to 
; determine # of bytes the address is past a 
; 16-byte-aligned address. 

sub 

ebx, 

eax 

; EBX = # of bytes above ESP to next 
; 16-byte-aligned address 

mov 

edi, 

Oh 

; EDI = 0000000Oh 

mov 

esi, 

8 0 0 0 0 0 0 Oh 

; EBX = 8000000Oh 

shr 

ebx, 

2 

; EBX = # of DWORDs past 16-byte-aligned address 
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mov [esp+4*ebx+12], esi 
mov [esp+4*ebx+8], edi 
mov [esp+4*ebx+4], esi 
mov [esp+4*ebx], edi 


; Move into address esp+4*ebx the single-precision 
; floating-point sign mask. 


THE 4 ASM LINES BELOW LOAD THE FUNCTION'S ARGUMENTS INTO GENERAL-PURPOSE 
REGISTERS (GPRS) 

esi = address of array "x" 
edi = address of array "y" 
ecx = # of cmplx products to compute 

eax = address of product to which results are stored 


mov esi, 
mov edi, 
mov ecx, 
mov eax, 


[ebp+8] 

[ebp+12] 

[ebp+16] 

[ebp+20] 


,- esi = ->x 
; edi = ->y 

; ecx = num_cmplx_elem 
; eax = ->prod 


THE 6 ASM LINES BELOW OFFSET THE ADDRESS TO THE ARRAYS x[] AND y[] SUCH 
THAT THEY CAN BE ACCESSED IN THE MOST EFFICIENT MANNER AS ILLUSTRATED 
BELOW IN THE LOOP mult4cmplxnum_loop WITH THE MINIMUM NUMBER OF 
ADDRESS INCREMENTS 


mov 

edx, 

ecx 

; edx 

= num cmplx elem 



neg 

ecx 


; ecx 

= -num cmplx elem 



shl 

edx, 

3 

; edx = 

8 * num cmplx elem 

= # bytes 

in x[] and 

add 

esi, 

edx 

; esi = 

-> to last element 

of x[] to 

multiply 

add 

edi, 

edx 

; edi = 

-> to last element 

of y[] to 

multiply 

add 

eax, 

edx 

; eax = 

-> end of prod[] to 

calculate 


to multiply 


THIS LOOP MULTIPLIES 4 COMPLEX #s FROM "x[]" UPON 4 COMPLEX #s FROM "y[]" 
AND RETURNS THE PRODUCT IN "prod[]". 


ALIGN 16 ; Align 

eight_cmplx_prod_loop: 


movaps 

xmmO, 

[esi+ecx*8] 

movaps 

xmml, 

[esi+ecx*8+16] 

movaps 

xmm4, 

[edi+ecx*8] 

movaps 

xmm5, 

[edi+ecx*8+16] 

movaps 

xmm2, 

xmmO 


movaps 

xmm3, 

xmml 


movaps 

xmm6, 

xmm4 


movaps 

xmm7, 

xmm5 


shufps 

xmmO, 

xmmO, 

10100000b 

shufps 

xmml, 

xmml, 

10100000b 

shufps 

xmm2, 

xmm2, 

11110101b 

shufps 

xmm3, 

xmm3, 

11110101b 

xorps 

xmm6, 

[esp+4 

*ebx] 

xorps 

xmm7, 

[esp+4 

*ebx] 


mulps xmmO, xmm4 
mulps xmml, xmm5 
shufps xmm7, xmm7, 10110001b 


address of loop to a 16-byte boundary. 

xmm0=[xli,xlr,xOi,xOr] 

xmml=[x3i,x3r,x2i,x2r] 

xmm4=[yli,ylr,yOi,yOr] 

xmm5=[y3i,y3r,y2i,y2r] 

xmm2=[xli,xlr,xOi,xOr] 

xmm3=[x3i,x3r,x2i,x2r] 

xmm6=[yli,ylr,yOi,yOr] 

xmm7=[y3i,y3r,y2i,y2r] 

xmm0=[xlr,xlr,xOr,xOr] 

xmml=[x3r,x3r,x2r,x2r] 

xmm2=[xli,xli,xOi,xOi] 

xmm3=[x3i,x3i,x2i,x2i] 

xmm6=[-yli,ylr,-y0i,y0r] 

xmm7=[-y3i,y3r,-y2i,y2r] 

xmm0=[xlr*yli,xlr*ylr,x0r*y0i,x0r*y0r] 

xmml=[x3r*y3i,x3r*y3r,x2r*y2i,x2r*y2r] 

xmm7=[y3r,-y3i,y2r,-y2i] 
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mulps xmm2, xmm6 
mulps xmm3, xmm7 
addps xmmO, xmm2 

addps xmml, xmm3 

movntps [eax+ecx*8], xmmO 
movntps [eax+ecx*8+16], xmml 
add ecx, 4 

jnz eight_cmplx_prod_loop 


xmm2=[xli*ylr,-xli*yli,x0i*y0r,-x0i*y0i] 
xmm3=[x3i*y3r,-x3i*y3i,x2i*y2r,-x2i*y2i] 
xmmO=[xlr*yli+xli*ylr,xlr*ylr-xli*yli, 
x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i] 
xmml=[x3r*y3i+x3i*y3r,x3r*y3r-x3i*y3i, 
x2r*y2i+x2i*y2r,x2r*y2r-x2i*y2i] 

Stream XMMO and XMM1 to representative 
memory address of prod[]. 

ECX = ECX + 4 


sf ence 


Finish all memory writes. 


INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS 
ENTERED 

REGISTERS EAX, ECX, AND EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
add esp, 32 
pop edi 
pop esi 
pop ebx 
mov esp, ebp 


pop ebp 


ret 

_cmplx_multiply_sse ENDP 

_TEXT ENDS 

END 

Listing 26. Complex Multiplication of Streams of Complex Numbers (3DNow!™ Technology) 

; cmplx_multiply_3dnow(float *x, float *y, int num_cmplx_elem, float *prod); 

I 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c cmplx_multiply_3dnow.asm 

I 

.586 
. K3D 
. XMM 

_TEXT SEGMENT 

PUBLIC _cmplx_multiply_3dnow 

;cmplx_multiply_3dnow(float *x, float *y, int num_cmplx_elem, float *prod); 

I 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c cmplx_multiply_3dnow.asm 

.586 
. K3D 
.XMM 

_TEXT SEGMENT 

PUBLIC _cmplx_multiply_3dnow 
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cmplx_multiply_3dnow PROC NEAR 


INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS ENTERED 
REGISTERS EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
push ebp 
mov ebp, esp 


Parameters passed into routine: 
[ebp+8] = ->x 

[ebp+12] = ->y 
[ebp+16] = num_cmplx_elem 
[ebp+20] = ->prod 


push ebx 
push esi 
push edi 


THE 4 ASM LINES BELOW LOAD THE FUNCTION'S ARGUMENTS INTO GENERAL-PURPOSE 
REGISTERS (GPRS) 
esi = address of array "x" 
edi = address of array "y" 
ecx = # of cmplx products to compute 

eax = address of product to which results are stored 


mov 

esi, 

[ebp+8] 

; esi = ->x 

mov 

edi, 

[ebp+12] 

; edi = ->y 

mov 

ecx, 

[ebp+16] 

; ecx = num cmplx elem 

mov 

eax, 

[ebp+20] 

; eax = ->prod 


THE 6 ASM LINES BELOW OFFSET THE ADDRESS TO THE ARRAYS x[] AND y[] SUCH 
THAT THEY CAN BE ACCESSED IN THE MOST EFFICIENT MANNER AS ILLUSTRATED 
BELOW IN THE LOOP mult4cmplxnum_loop WITH THE MINIMUM NUMBER OF 
ADDRESS INCREMENTS 


mov edx, ecx 
neg ecx 
imul edx, 8 
add esi, edx 
add edi, edx 
add eax, edx 


edx = num_cmplx_elem] 
ecx = -num_cmplx_elem 

edx = 8 * num_cmplx_elem = # bytes in x[] and y[] to multiply 
esi = -> to last element of x[] to multiply 
edi = -> to last element of y[] to multiply 
eax = -> end of prod[] to calculate 


THIS LOOP MULTIPLIES 4 COMPLEX #s FROM "x[]" UPON 4 COMPLEX #s FROM "y[]" 
AND RETURNS THE PRODUCT IN "prod[]". 


/ 

ALIGN 16 
four cmplx 

_prod_ 

_loop: 


; Align address 

i of loop to a 16-byte boundary. 

! 

movq 

mmO, 

QWORD 

PTR 

[esi+ecx*8] 

mm0=[xOi,xOr] 

movq 

mml, 

QWORD 

PTR 

[esi+ecx*8+8] 

mml=[xli,xlr] 

movq 

mm2, 

QWORD 

PTR 

[esi+ecx*8+16] 

mm2=[x2i,x2r] 

movq 

mm3, 

QWORD 

PTR 

[esi+ecx*8+24] 

mm3=[x3i,x3r] 
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pswapd mm4, QWORD PTR 
pswapd mm5, QWORD PTR 
pswapd mm6, QWORD PTR 
pswapd mm7, QWORD PTR 
pfmul mmO, QWORD PTR 
pfmul mml, QWORD PTR 
pfmul mm2, QWORD PTR 
pfmul mm3, QWORD PTR 
pfmul mm4, QWORD PTR 
pfmul mm5, QWORD PTR 
pfmul mm6, QWORD PTR 
pfmul mm7, QWORD PTR 
pfpnacc mmO, mm4 
pfpnacc mml, mm5 
pfpnacc mm2, mm6 
pfpnacc mm3, mm7 
movntq [eax+ecx*8], mmO 
movntq [eax+ecx*8+8], mml 
movntq [eax+ecx*8+16], mm2 
movntq [eax+ecx*8+24], mm3 
add ecx, 4 

jnz four_cmplx_prod_loop 


[esi+ecx*8] 

[esi+ecx*8+8] 

[esi+ecx*8+16] 

[esi+ecx*8+24] 

[edi+ecx*8] 

[edi+ecx*8+8] 

[edi+ecx*8+16] 

[edi+ecx*8+24] 

[edi+ecx*8] 

[edi+ecx*8+8] 

[edi+ecx*8+16] 

[edi+ecx*8+24] 


mm4=[xOr,xOi] 
mm5=[xlr,xli] 
mm6=[x2r,x2i] 
mm7=[x3r,x3i] 
mm0=[x0i*y0i,x0r*y0r] 
mml=[xli*yli,xlr*ylr] 
mm2=[x2i*y2i,x2r*y2r] 
mm3=[x3i*y3i,x3r*y3r] 
mm4=[x0r*y0i,x0i*y0r] 
mm5=[xlr*yli,xli*ylr] 
mm6=[x2r*y2i,x2i*y2r] 
mm7=[x3r*y3i,x3i*y3r] 
mm0=[x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i] 
mml=[xlr*yli+xli*ylr,xlr*ylr-xli*yli] 
mm2=[x2r*y2i+x2i*y2r,x2r*y2r-x2i*y2i] 
mm3=[x3r*y3i+x3i*y3r,x3r*y3r-x3i*y3i] 
Stream MMO-MM3 to representative memory 
addresses of prod[] 


ECX = ECX + 4 


sfence 


Finish all memory writes. 


INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS 
ENTERED 

REGISTERS EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
f emms 


pop edi 
pop esi 
pop ebx 
mov esp, ebp 
pop ebp 


ret 

_cmplx_multiply_3dnow ENDP 

_TEXT ENDS 

END 

The illustrations above make use of many optimization techniques. First, the 3DNow! technology 
code utilizes the PSWAPD and PFPNACC instructions, whose operations are outlined below: 

; PSWAPD 

; Suppose that MMO contains two floats: r and i. 

; INPUT: 

; MMO = [i, r] 

; OUTPUT: 

; MM1 = [r,i] 
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pswapd mml, mmO ; MM1 = [r,i] 

; Additionally, PSWAPD can be used with a 64-bit memory location. Suppose 
; that EDI contains the address of two floats: r and i. 

; INPUT: 

; [EDI:EDI+8] = [b,a] 

; OUTPUT: 

; MM1 = [r,i] 

pswapd mml, [edi] ; MM1 = [r,i] 


; PFPNACC 

; Suppose that MMO contains two floats: rl * r2 (the product of the real parts 
; of 2 complex numbers) and il * i2 (the product of the imaginary parts 
; of 2 complex numbers). 

; Also suppose that MM1 contains two floats: rl * i2 (the product of the real 
; part of the first complex number and the imaginary part of the second 
; complex number) and il * r2 (the product of the imaginary part of the 
; first complex number and the real part of the second complex number). 

; INPUTS: 

; MMO = [il*i2,rl*r2] 

; MM1 = [il*r2,rl*i2] 

; OUTPUT: 

; MMO = [rl*i2+il*r2,rl*r2-il*i2] 

pfpnacc mmO, mml ; MMO = [rl*i2+il*r2,rl*r2-il*i2] 

; Additionally, PSWAPD can be used with a 64-bit memory location. Suppose 
; that EDI contains the address of two floats: rl * i2 (the product of the 
; real part of the first complex number and the imaginary part of the 
; second complex number) and il * r2 (the product of the imaginary part of 
; the first complex and the real part of the second complex number). 

; INPUTS: 

; MMO = [il*i2,rl*r2] 

; [EDI:EDI+8] = [il*r2,rl*i2] 

; OUTPUT: 

; MMO = [rl*i2+il*r2,rl*r2-il*i2] 

pfpnacc mmO, [edi] ; MMO = [rl*i2+il*r2,rl*r2-il*i2] 

The PFPNACC instruction is specifically designed for use in complex arithmetic operations. 
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Additionally, four complex numbers are concurrently multiplied in the examples using SSE and 
3DNow! instructions to break up register dependencies. Loads, multiplications, and additions do not 
execute with zero delay, but have a latency associated with them. The following instructions: 


movq 

mmO, 

QWORD 

PTR 

[esi+ecx*8] 

; mmO = 

[xOi,xOr] 

pswapd 

mm4, 

QWORD 

PTR 

[esi+ecx*8] 

; mm4 = 

[xOr,xOi] 

pfmul 

mmO, 

QWORD 

PTR 

[edi+ecx*8] 

; mmO = 

[x0i*y0i,x0r*y0r] 

pfmul 

mm4, 

QWORD 

PTR 

[edi+ecx*8] 

; mm4 = 

[x0r*y0i,x0i*y0r] 

pfpnacc 

mmO, 

mm4 



; mmO = 

[x0r*y0i+x0i*y0r,x0r*y0r-x0i*y0i] 


are dependent upon one another. The move from memory (MOVQ) requires 2 cycles, PSWAPD also 
requires 2 cycles, the two PFMUL instructions require 6 cycles, and PFPNACC requires 6 cycles. 
The instruction flow through the processor is illustrated on a clock-cycle basis, as follows: 


Instruction 


0 2 4 6 8 10 12 14 


MOVQ 

PSWAPD 

PFMUL 

PFMUL 

PFPNACC 


xxxxxx 

xxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxxx 


and takes 15 cycles to finish. During this 15 cycles, the processor has the ability to perform 60 single¬ 
precision floating-point operations, of which it only performs six. The majority of the time is spent 
waiting for previous instructions to terminate so that arguments to future instructions are available. By 
unrolling the multiplication, working with four complex numbers per clock, there are enough 
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instructions that are not dependent on previous or presently executing operations so that the processor 
can mask the execution latency by keeping itself busy, as illustrated below: 

Instruction 0 2 4 6 8 10 12 14 16 18 


MOVQ 

MOVQ 

MOVQ 

MOVQ 

PSWAPD 

PSWAPD 

PSWAPD 

PSWAPD 

PFMUL 

PFMUL 

PFMUL 

PFMUL 

PFMUL 

PFMUL 

PFMUL 

PFMUL 

PFPNACC 

PFPNACC 

PFPNACC 

PFPNACC 


xxxxxx 

xxxxxx 

xxxxxx 

xxxxxx 

xxxxxx 

xxxxxx 

xxxxxx 

xxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxxx 

xxxxxxxxxxxxxxxxxxx 


Multiplying four complex single-precision numbers only takes 17 cycles as opposed to 14 cycles to 
multiply one complex single-precision number. The floating-point pipes are kept busy by feeding new 
instructions into the floating-point pipeline each cycle. In the arrangement above, 24 floating-point 
operations are performed in 17 cycles, achieving more than a 3.5x increase in performance. 

The last optimization in both implementations is the use of the MOVNTQ and MOVNTPS 
instructions, nontemporal writes to memory that stream data to main memory. These instructions 
increase throughput to memory and make more efficient use of the bandwidth provided by the 
processor and memory controller. Nontemporal writes, such as MOVNTQ, MOVNTPS, and 
MOVNTDQ, should only be used on data that is not going to be accessed again in the near future. 
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9.17 Optimized 4x4 Matrix Multiplication on 4 x 1 
Column Vector Routines 

Optimization 

Transpose the rotation matrix to eliminate the need to accumulate floating-point values in an XMM 
register. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The multiplication of a 4 x 4 matrix with a 4 x 1 vector is commonly used in 3-D graphics for 
geometric transformation (translating, scaling, rotating, and applying perspective to 3-D points 
represented in homogeneous coordinates). Efficiency in single-precision matrix multiplication can be 
enhanced by use of SIMD instructions to increase throughput, but there are other general 
optimizations that can be implemented to further increase performance. The first optimization is the 
transposition of the rotation matrix such that the column n of the matrix becomes the row n and the 
row m becomes the column in. This optimization does not benefit 3DNow! technology code (3DNow! 
technology has extended instructions that preclude the need for this optimization), but does benefit 
SSE code. There are no SSE or SSE2 instructions that accumulate the floats and doubles in a single 
XMM register; for this reason, the matrix must be transposed. If the rotation matrix is not transposed, 
then the dot-product of a row of the matrix with a column vector necessitates the accumulation of the 
four floating-point values in an XMM register. The multiplication upon the column vector is 
illustrated here: 





| rOO 

rOl r02 

r03 



| rOO 

rlO 

r2 0 

r3 0 | 

|v0| 

1 v 

0 

tr (R) 

x v = tr 

j no 

rll rl2 

rl3 

1 X 

V = 

rOl 

rll 

r21 

r31 j x 

1 vl 1 

= |v 

1 




j r2 0 

r21 r22 

r23 

1 


j r02 

rl2 

r22 

r32 

1 v2 

1 V 

2 




j r3 0 

r31 r32 

r33 

1 


r03 

rl3 

r23 

r33 

|v3| 

1 V 

3 



Step 

0 

Step 

1 


Step 

i 2 


Step 3 




1 v 

o| 

| rOO x 

vO | 

| rOl x 

vl| 

+ 

| r02 

x v2 

+ 

| r03 

x v3 | 




|v 

'll 

= jrlO x 

vO j 

+ |rll x 

vl| 

+ 

j rl2 

x v2 

+ 

rl3 

x v3 




|v 

2| 

j r20 x 

vO j 

j r21 x 

vl| 

+ 

j r22 

x v2 

+ 

r23 

x v3 




|v 

3 1 

r3 0 x 

vO 

r31 x 

Vl I 

+ 

r32 

x v2 

+ 

r33 

x v3 





In each step above, the elements of the rotation matrix can be loaded into an XMM register with the 
MOVAPS instruction, assuming the rotation matrix begins at a 16-byte-aligned memory location. 
Transposition of the rotation matrix eliminates the need to accumulate the floating-point values in an 
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XMM register, but it does require the duplication of the elements of the 4x1 column vector V in all 
four floating-point values of the XMM register in each step above. Listing 27 is an SSE function that 
performs 4x4 matrix multiplication upon a stream of num_vertices_to_rotate vertices. 

Examples 

Listing 27. 4x4 Matrix Multiplication (SSE) 

; matrix_x_vector_sse(float *trR, float *v, int num_vertices_to_rotate, 
float *rotv); 

I 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c matrix_x_vector_sse.asm 

I 

.586 
. K3D 
.XMM 

_TEXT SEGMENT 
PUBLIC _matrix_x_vector_sse 
matrix x vector sse PROC NEAR 


INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS 
ENTERED. 

REGISTERS EAX, ECX, AND EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED, 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
push ebp 
mov ebp, esp 


Parameters passed into routine: 
[ebp+8] = ->trR 

[ebp+12] = ->v 

[ebp+16] = num_vertices_to_rotate 
[ebp+20] = ->rotv 


push ebx 
push esi 
push edi 


THE 4 ASM LINES BELOW LOAD THE FUNCTION'S ARGUMENTS INTO GENERAL-PURPOSE 
REGISTERS (GPRS) 

esi = address of Transposed Rotation Matrix 
edi = address of vertices to rotate 


ecx 

= # of 

vertices 

to 

rotate 



eax 

= address of rotated vertices 



mov 

esi, 

[ebp+8] 

/ 

ESI = ->trR 



mov 

edi, 

[ebp+12] 

/ 

EDI = ->v 



mov 

ecx, 

[ebp+16] 

/ 

ECX = num vertices 

to 

rotate 

mov 

edx, 

ecx 

/ 

EDX = num vertices 

to 

rotate 

shl 

edx, 

4 

/ 

EDX = 16*num vertices 

to rotate 

mov 

eax, 

[ebp+20] 

/ 

EAX = ->rotv 
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imul 

ecx, 

2 

; ECX = 

# quadwords of vertices to 

rotate 

add 

edi, 

edx 

; EDI = 

-> end of "v" 


add 

eax, 

edx 

; EAX = 

-> end of "rotv" 


neg 

ecx 


; ECX = 

-# quadwords of vertices to 

rotate 

THE 4 

ASM 

LINES 

BELOW LOAD THE 

TRANSPOSED ROTATION MATRIX 

"R" INTO XMMO-XMM3 


IN THE FOLLOWING MANNER: 

xmmO = column 0 of "R" or row 0 of "R" transpose 
xmml = column 1 of "R" or row 1 of "R" transpose 
xmm2 = column 2 of "R" or row 2 of "R" transpose 
xmm3 = column 3 of "R" or row 3 of "R" transpose 


movaps 

xmmO, 

[esi] 

; XMMO = 

[R3 0,R20,RIO, R00] 

movaps 

xmml, 

[esi+16] 

; XMM1 = 

[R31,R21,Rll,R01] 

movaps 

xmm2, 

[esi+32] 

; XMM2 = 

[R3 2,R22,R12,R02] 

movaps 

xmm3, 

[esi+48] 

; XMM3 = 

[R3 3,R23,R13, R03] 


THIS LOOP ROTATES "num_vertices_to_rotate" VERTICES BY THE TRANSPOSED 
ROTATION MATRIX "R" PASSED INTO THE ROUTINE AND STORES THE ROTATED 
VERTICES TO "rotv". 


ALIGN 16 ; Align address of loop to a 16-byte boundary. 

rotate_vertices_loop: 


movlps 

xmm4, 

[edi+8*ecx] 

XMM4=[,,vl,vO] 


movlps 

xmm6, 

[edi+8*ecx+8] 

XMM6=[,,v3,v2] 


unpcklps 

xmm4, 

xmm4 

XMM4=[vl,vl,vO,vO] 


unpcklps 

xmm6, 

xmm6 

XMM6=[v3,v3,v2,v2] 


movhlps 

xmm5, 

xmm4 

XMM5=[,,vl,vl] 


movhlps 

xmm7, 

xmm6 

XMM7 =[, ,v3,v3] 


movlhps 

xmm4, 

xmm4 

XMM4=[vO,vO,vO,vO] 


mulps 

xmm4, 

xmmO 

XMM4=[R3 0 *v0,R2 0*v0,R10*v0 

,R0 0 *v0] 

movlhps 

xmm5, 

xmm5 

XMM5=[vl,vl,vl,vl] 


mulps 

xmm5, 

xmml 

XMM5=[R31*vl,R21*vl,Rll*vl 

,R01*vl] 

movlhps 

xmm6, 

xmm6 

XMM6=[v2,v2,v2,v2] 


mulps 

xmm6, 

xmm2 

XMM6=[R3 2 *v2,R2 2 *v2,R12 *v2 

, R02 *v2] 

addps 

xmm4, 

xmm5 

XMM4=[R3 0 *vO+R3l*vl,R2 0 *vO+R21*vl, 




R10*v0+Rll*vl,R00*v0+R01* 

vl] 

movlhps 

xmm7, 

xmm7 

XMM7=[v3,v3,v3,v3] 


mulps 

xmm7, 

xmm3 

XMM6=[R3 3 *v3,R2 3 *v3,R13 *v3 

, R03 *v3] 

addps 

xmm6, 

xmm7 

XMM6=[R3 2 *v2 +R3 3 *v3,R22 *v2 +R2 3 *v3, 




R12 *v2 + R13 *v3,R02 *v2 +R03 * 

v3] 

addps 

xmm4, 

xmm6 

XMM4=New rotated vertex 


movntps 

[eax+8*ecx], xmm4 

Store rotated vertex to rotv. 

add 

ecx, 

2 

Decrement the # of QWORDs 

to rotate by 2 

j nz 

rotate_vertices_loop 



sfence 



Finish all memory writes. 



INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE 
WAS ENTERED 
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; REGISTERS EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED 
; WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
pop edi 
pop esi 
pop ebx 
mov esp, ebp 
pop ebp 


ret 

_matrix_x_vector_sse ENDP 

_TEXT ENDS 

END 

To greatly enhance performance, the previous function can perform the matrix multiplication not only 
upon one four-column vector, but upon many. Creating a separate function to transform a single 
vertex and repeatedly calling the function is prohibitively expensive because of the overhead in 
pushing and popping registers from the stack. This applies to routines that negate a single vector, 
nullify a single vector, and add two vectors. Listing 28 is the 3DNow! technology counterpart to 
Listing 27 on page 231. 

Listing 28. 4x4 Matrix Multiplication (3DNow!™ Technology) 

; matrix_x_vector_3dnow(float *trR, float *v, int num_vertices_to_rotate, 
float *rotv); 

; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: 

; ml.exe -coff -c matrix_x_vector_3dnow.asm 

I 

.586 
. K3D 
. XMM 

_TEXT SEGMENT 
PUBLIC _matrix_x_vector_3dnow 
matrix x vector 3dnow PROC NEAR 


INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS 
ENTERED. 

REGISTERS EAX, ECX, AND EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED, 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
push ebp 
mov ebp, esp 


Parameters passed into routine: 
[ebp+8] = ->trR 

[ebp+12] = ->v 

[ebp+16] = num_vertices_to_rotate 
[ebp+20] = ->rotv 


push ebx 
push esi 
push edi 
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THE 4 ASM LINES BELOW LOAD THE FUNCTION'S ARGUMENTS INTO GENERAL-PURPOSE 
REGISTERS (GPRs) 

eax = address of Transposed Rotation Matrix 
edx = address of vertices to rotate 
ecx = # of vertices to rotate 
ebx = address of rotated vertices 


mov 

eax, 

[ebp+8] 

; ESI = ->R 

mov 

edx, 

[ebp+12] 

; EDI = ->v 

mov 

ecx, 

[ebp+16] 

; ECX = num vertices to rotate 

mov 

ebx, 

[ebp+20] 

; EAX = ->rotv 


femms ; Clear MMX state. 

ALIGN 16 ; Ensure optimal branch alignment. 


THIS LOOP ROTATES "num_vertices_to_rotate" VERTICES BY THE TRANSPOSED 
ROTATION MATRIX "R" PASSED INTO THE ROUTINE AND STORES THE ROTATED 


; VERTICES TO 

"rotv". 




/ 

rotate_vertices_loop: 




add 

ebx,16 

Increment ->v to next vertex. 

movq 

mmO,[edx] 

MMO 

= 

[y,x] 

movq 

mml, [edx+8] 

MM1 

= 

[w, z] 

add 

edx,16 

Increment ->rotv to next transformed vertex. 

movq 

mm2,mmO 

MM2 

= 

[y,x] 

movq 

mm3,[eax] 

MM3 

= 

[R01,R00] 

punpckldq 

mmO,mmO 

MMO 

= 

[x, x] 

movq 

mm4, [eax+16] 

MM4 

= 

[Rll,RIO] 

pfmul 

mm3,mmO 

MM3 

= 

[x*R01,x*R0 0] 

punpckhdq 

mm2,mm2 

MM2 

= 

[y,y] 

pfmul 

mm4,mm2 

MM4 

= 

[y*Rll,y*R10] 

movq 

mm5, [eax+8] 

MM 5 

= 

[R03,R02] 

movq 

mm7, [eax+24] 

MM 7 

= 

[R13,R12] 

movq 

mm6,mml 

MM 6 

= 

[w, z] 

pfmul 

mm5,mmO 

MM 5 

= 

[x*R03,x*R02] 

movq 

mmO, [eax+32] 

MMO 

= 

[R21,R2 0] 

punpckldq 

mml,mml 

MM1 

= 

[z, z] 

pfmul 

mm7,mm2 

MM 7 

= 

[y*R13,y*R12] 

movq 

mm2, [eax+40] 

MM2 

= 

[R23,R22] 

pfmul 

mmO,mml 

MMO 

= 

[z*R21,z*R2 0] 

pf add 

mm3,mm4 

MM3 

= 

[x*R01+y*Rll,x*R00+y*R10] 

movq 

mm4, [eax+48] 

MM4 

= 

[R31,R30] 

pfmul 

mm2,mml 

MM2 

= 

[z*R23,z*R2 2] 

pf add 

mm5,mm7 

MM 5 

= 

[x*R03+y*R13],x*R02+y*R12] 

movq 

mml, [eax+56] 

MM1 

= 

[R3 3 , R3 2 ] 

punpckhdq 

mm6,mm6 

MM 6 

= 

[w, w] 

pf add 

mm3,mmO 

MM3 

= 

[x*R0l+y*Rll + z*R21,x*R0 0+y*Rl0 + z *R2 0] 

pfmul 

mm4,mm6 

MM4 

= 

[w*R31,w*R3 0] 

pfmul 

mml,mm6 

MM1 

= 

[w*R3 3,w*R3 2] 

pf add 

mm5,mm2 

MM 5 

= 

[x*R03+y*R13+z*R23,x*R02+y*R12+z*R22] 

pf add 

mm3,mm4 

MM3 

= 

[x*R0l+y*Rll + z*R2l+w*R31, 
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x*R0 0+y*R10 + z*R2 0+w*R3 0] 

movntq 

[ebx-16],mm3 

Store lower quadword of transformed vertex. 

pf add 

mm5,mml 

MM3 = [x*R03+y*R13+z*R23+w*R33, 



x*R02+y*R12 + z*R22 + w*R32] 

movntq 

[ebx-8],mm5 

Store upper QWORD of transformed vertex. 

dec 

ecx 

Decrement # of vertices to transform. 

j nz 

rotate_vertices_ 

ft 
0 
0 
i—1 

f emms 


Clear MMX state. 

sfence 


Finish all memory writes. 


INSTRUCTIONS BELOW RESTORE THE REGISTER STATE WITH WHICH THIS ROUTINE 
WAS ENTERED. 

REGISTERS EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED 
WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM 
pop edi 
pop esi 
pop ebx 
mov esp, ebp 
pop ebp 


ret 

_matrix_x_vector_3dnow ENDP 

_TEXT ENDS 

END 
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Chapter 10 x87 Floating-Point Optimizations 


AMD Athlon™ 64 and AMD Opteron™ processors support multiple methods of performing 
floating-point operations. They support the older x87 assembly instructions in addition to the more 
recent SIMD instructions (SSE, SSE2, and 3DNow!™ technologies). Many of the suggestions in this 
chapter are also generally applicable to the AMD Athlon 64 and AMD Opteron processors, with the 
exception of SSE2 optimizations and expanded register usage. 

AMD Athlon 64 and AMD Opteron processors are 64-bit processors that are fully backwards 
compatible with 32-bit code. In general, 64-bit operating systems support the x87 and 3DNow! 
instructions in 32-bit threads; however, 64-bit operating systems may not support x87 and 3DNow! 
instructions in 64-bit threads. To make it easier to later migrate from 32-bit to 64-bit code, you may 
want to avoid x87 and 3DNow! instructions altogether and use only SSE and SSE2 instructions when 
writing new 32-bit code. 

This chapter details the methods used to optimize floating-point code to the pipelined x87 floating¬ 
point registers. 

This chapter covers the following topics: 


Topic 

Page 

Using Multiplication Rather Than Division 

238 

Achieving Two Floating-Point Operations per Clock Cycle 

239 

Floating-Point Compare Instructions 

244 

Using the FXCH Instruction Rather Than FST/FLD Pairs 

245 

Floating-Point Subexpression Elimination 

246 

Accumulating Precision-Sensitive Ouantities in x87 Registers 

247 

Avoiding Extended-Precision Data 

248 
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10.1 Using Multiplication Rather Than Division 

Optimization 

If accuracy requirements allow, convert floating-point division by a constant to multiplication by the 
reciprocal. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Divisors that are powers of two—and their reciprocals—are exactly representable, and therefore do 
not cause an accuracy issue, except for the rare case in which the reciprocal overflows or underflows. 
Unless such an overflow or underflow occurs, always convert a division by a power of two for 
multiplication. Although the AMD Athlon 64 and AMD Opteron processors have high-performance 
division, multiplication is significantly faster than division. 
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10.2 Achieving Two Floating-Point Operations per 
Clock Cycle 

Optimization 

Pay special attention to the order and packing of the operations to sustain up to two floating-point 
operations per clock cycle. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The floating-point unit in the AMD Athlon, AMD Athlon 64, and AMD Opteron processors can 
sustain up to two floating-point operations per clock cycle. However, to achieve this, you must pay 
special attention to the order and packing of the operations. For example, consider multiplying a 
30 x 30 double-precision matrix A by a transposed 30 x 30 double-precision matrix B, the result of 
which is called C. 

Use Efficient Addressing of FPU Data When Loading and Storing 

The rows of A are 240 bytes wide, as are the columns of B. There are eight x87 floating-point 
registers [ST(0)-ST(7)], and in this example, six rows of A are concurrently multiplied by a single 
column of B. The address of the first element of the first row of A (A[0]) is presumed to be stored in 
the EDI register, while the address of the first element of the first column of B (B[0]) is stored in ESI. 

This addressing scheme might seem like a good idea, but it is not. Only 128 bytes can be addressed 
forward of A[0] with 8-bit offsets, meaning the size of the instructions are only 3 bytes (2 bytes for 
the instruction and 1 byte for the offset to the address stored in the general-purpose register). Upon 
offsetting more than 128 bytes from the address in the general-purpose register, the size of the 
instruction increases from 3 bytes to 6 bytes (offsets larger than 128 bytes are represented by 32 bits 
rather than 8 bits). Large instruction sizes reduce the number of decoded operations to be executed 
within the pipes of the floating-point unit, and as such prevent us from achieving two floating-point 
operations per clock cycle. To alleviate this, the general-purpose registers EDI and ESI are offset by 
128 bytes such that they contain the addresses of A[15] and B[15]. This is beneficial because data 
within 128 bytes (16 double-precision numbers) before or after these two locations can now be 
accessed with instructions that are 2-3 bytes in size. The next five rows of A can be efficiently 
addressed in terms of the first row. Storing the size of a single row of A (240 bytes) in the EAX 
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register, the size of three rows (720 bytes) in EBX, and the size of five rows (1200 bytes) in ECX, the 
first element of rows 0-5 of A can be addressed as follows: 


fid QWORD 
fid QWORD 
fid QWORD 
fid QWORD 
fid QWORD 
fid QWORD 


PTR [edi 
PTR [edi 
PTR [edi 
PTR [edi 
PTR [edi 
PTR [edi 


128] 

eax-128] 

eax*2-128] 

ebx-128] 

eax*4-128] 

ecx-128] 


; Load A[i,0]. 

; Load A[i+1,0]. 
; Load A [i+2, 0] . 
; Load A[i + 3,0] . 
; Load A[i+4,0] . 
; Load A[i + 5,0] . 


This addressing scheme reduces the size of all loads from memory to 3 bytes; additionally, to address 
rows 6-11 of A, you only need to add 240*6 to EDI. 


Avoid Register Dependencies by Spacing Apart Instructions that Accumulate Results 
in a Register 

The second general optimization to consider is spacing out register dependencies. Operations 
internally in the floating-point unit have an execution latency (normally 3-4 cycles for x87 
operations). Consider this instruction sequence: 


fldz 

fid QWORD PTR [edi-128] 
fmul QWORD PTR [esi-128] 
faddp st (1) , st (0) 

fid QWORD PTR [edi-120] 
fmul QWORD PTR [esi-120] 
faddp st (1) , st (0) 

fid QWORD PTR [edi-112] 
fmul QWORD PTR [esi-112] 
faddp st (1) , st (0) 


; Push 0.0 onto floating-point stack. 

; Push A[i,0] onto stack. 

; Multiply A[i,0] by B[0,j]. 

; Accumulate contribution to dot product of 
; A's row i and B's column j. 

; Push A[i,l] onto stack. 

; Multiply A[i,1] by B[l,j]. 

; Accumulate contribution to dot product of 
; A's row i and B's column j. 

; Push A[i,2] onto stack. 

; Multiply A[i,2] by B[2,j]. 

; Accumulate contribution to dot product of 
; A's row i and B's column j. 


The second statement loads A[0] into ST(0), and the third statement multiplies it by B[0]. The 
subsequent line adds this product to ST(1), where the dot product of row 0 of matrix A and column 0 
of matrix B is accumulated. Each of the subsequent groups of three instructions adds the contribution 
of the remaining 29 elements to the dot product. This code is poor because all the addition operations 
depend upon the contents of a single register, ST(1). The AMD Athlon, AMD Athlon 64 and 
AMD Opteron processors have out-of-order-execution floating-point units, but none of the addition 
operations can be performed out of order because the result of each addition operation depends on the 
outcome of the previous addition operation. Instruction scheduling based on this code greatly limits 
the throughput of the floating-point unit. To alleviate this, space out operations that are dependent on 
one another. In this case, work with six rows of A rather than one at a time, as follows: 


; Multiply first element of each of six rows of A by first element of 
; B's column j. 

fldz ; Push 0.0 six times onto floating-point stack, 

fldz 
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fldz 

fldz 

fldz 

fldz 

fid QWORD PTR [esi-128] ; Push B[0,j] onto stack. 


fid QWORD PTR [edi-128] 
fmul st (0) , st (1) 
faddp st (7) , st (0) 


Push A[i,0] onto stack. 

Multiply A[i,0] by B[0,j] . 

Accumulate contribution to dot product of 
A's row i and B's column j. 


fid QWORD PTR [edi+eax-128] 
fmul st (0) , st (1) 
faddp st (6) , st (0) 


Push A[i+1,0] onto stack. 

Multiply A[i+1,0] by B[0,j]. 

Accumulate contribution to dot product of 
A's row i + 1 and B's column j. 


fid QWORD PTR [edi+eax*2-128] 
fmul st (0) , st (1) 
faddp st (5) , st (0) 


Push A[i+2,0] onto stack. 

Multiply A[i+2,0] by B[0,j] . 

Accumulate contribution to dot product of 
A's row i+2 and B's column j. 


fid QWORD PTR [edi+ebx-128] 
fmul st (0) , st (1) 
faddp st (4) , st (0) 


Push A[i+3,0] onto stack. 

Multiply A[i + 3,0] by B[0,j] . 

Accumulate contribution to dot product of 
A's row i + 3 and B's column j. 


fid QWORD PTR [edi+eax*4-128] 
fmul st (0) , st (1) 
faddp st (3) , st (0) 


Push A[i+4,0] onto stack. 

Multiply A[i+4,0] by B[0,j] . 

Accumulate contribution to dot product of 
A's row i+4 and B's column j. 


fmul QWORD PTR [edi+ecx-128] 
faddp st (1) , st (0) 


Multiply A[i + 5,0] by B[0,j] . 

Accumulate contribution to dot product of 
A's row i+5 and B's column j. 


The processor can execute the instructions in this code sequence out of order because the instructions 
are independent. Even though the loads and multiplies are performed sequentially, the floating-point 
scheduler can execute the FLD and FMUL instructions out of order in addition to the FADD 
instruction so as to keep the multiplier and adder pipes of the floating-point unit busy. B[0] is initially 
loaded into an x87 register and multiplied by the loaded elements of each row with the reg, reg 
form of FMUL to minimize the number of load operations that need to be performed. Additionally, 
the first element from the sixth row of A is not loaded but simply multiplied from memory by the 
loaded element of B[0]. This eliminates an FLD instruction and decreases the number of instructions 
in the instruction cache and the workload on the processor’s decoder. To achieve two floating-point 
operations per clock cycle, the number of floating-point operations should be twice the number of 
load-store operations. In the example above, there are 12 floating-point operations and seven 
operations requiring loads from memory, so nearly two floating-point operations can be performed 
per clock cycle. 
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Align and Pack DirectPath x87 Instructions 

The last optimization to be performed is code packing and alignment. Having an abundance of 
operations in the decoder keeps the processor’s schedulers well fed in circumstances where 
instructions cannot be immediately provided to the decoders. Floating-point x87 code can be aligned 
to 8-byte boundaries as illustrated here, which is optimal on AMD Athlon, AMD Athlon 64, and 
AMD Opteron processors: 

/Instruction Address Opcode Instruction 


00000360 

66 



DB 

066h 

00000361 

DD 

06 


fid 

QWORD PTR [esi] 

00000363 

66 



DB 

066h 

00000364 

DD 

07 


fid 

QWORD PTR [edi] 

00000366 

D8 

C9 


fmul 

St (0) , St (1) 

00000368 

DE 

C7 


f addp 

St(7), st(0) 

0000036A 

DD 

04 

38 

fid 

QWORD PTR [edi+eax] 

0000036D 

66 



DB 

066h 

0000036E 

D8 

C9 


fmul 

st (0) , st (1) 

00000370 

DE 

C6 


f addp 

St (6) , St (0) 

00000372 

DD 

04 

47 

fid 

QWORD PTR [edi+eax*2] 

00000375 

66 



DB 

066h 

00000376 

D8 

C9 


fmul 

st (0) , st (1) 

00000378 

DE 

C5 


f addp 

St (5) , st (0) 

0000037A 

DD 

04 

3B 

fid 

QWORD PTR [edi+ebx] 

0000037D 

66 



DB 

066h 

0000037E 

D8 

C9 


fmul 

st (0) , st (1) 

00000380 

DE 

C4 


f addp 

St (4) , st (0) 

00000382 

DD 

04 

87 

fid 

QWORD PTR [edi+eax*4] 

00000385 

66 



DB 

066h 

00000386 

D8 

C9 


fmul 

st (0) , st (1) 

00000388 

DE 

C3 


f addp 

St (3) , St (0) 

0000038A 

DC 

OC 

39 

fmul 

QWORD PTR [edi+ecx] 

0000038D 

66 



DB 

066h 

0000038E 

DE 

Cl 


f addp 

St (1) , St ( 0 ) 


The instruction address specifies the address (in hexadecimal) of the instruction to the right. 

Typically three DirectPath instructions occupy 7 bytes. Maintaining 8-byte alignment for the next 
group of three instructions requires the addition of a single byte. A 1-byte padding can easily be 
achieved using the single-byte NOP instruction (opcode 90h), as recommended in “Code Padding 
with Operand-Size Override and NOP” on page 89. However, for the special case of x87 instructions, 
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the operand-size override (66h) serves as a high-performance NOP instruction and is the 
recommended choice for padding an x87 instruction without altering its behavior, as shown here: 

DB 066h ; Operand-size override used as high-performance NOP instruction 

This usage of the operand-size override alone as a filler byte (without an accompanying NOP 
instruction) is permitted only for x87 instructions. This usage of the operand-size override can be 
applied to all but four of the x87 instructions. The FLDENV, FRSTOR, FSTENV, and FSAVE 
instructions and their no-wait forms behave differently when associated with an operand-size 
override; therefore, these should not be padded with the operand-size override. 
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10.3 Floating-Point Compare Instructions 

Optimization 

For branches that are dependent on floating-point comparisons, use the FCOMI, FCOMIP, FUCOMI, 
and FUCOMIP instructions: 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The FCOMI, FCOMIP, FUCOMI, and FUCOMIP instructions are much faster than the classical 
approach using FSTSW. When FSTSW cannot be avoided (for example, backward compatibility of 
code with older processors), no floating-point instruction should occur between an FCOM, FCOMP, 
FCOMPP, FICOM, FICOMP, FUCOM, FUCOMP, FUCOMPP, or FTST instruction and a dependent 
FSTSW instruction. This optimization allows the use of a fast-forwarding mechanism for the floating¬ 
point condition codes internal to the processor’s floating-point unit and increases performance. 
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10.4 Using the FXCH Instruction Rather Than FST/FLD 
Pairs 

Optimization 

Increase parallelism by breaking up dependency chains or by evaluating multiple dependency chains 
simultaneously by explicitly switching execution between them. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Although the AMD Athlon 64 and AMD Opteron processor’s floating-point unit has a deep 
scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency 
chains can stall the scheduler while issue slots are still available. The maximum dependency chain 
length that the scheduler can absorb is about six four-cycle instructions. 

To switch execution between dependency chains, use of the FXCH instruction is recommended 
because it has an apparent latency of zero cycles and generates only one micro-op. The floating-point 
unit of the AMD Athlon 64 and AMD Opteron processors contains special hardware to handle up to 
three FXCH instructions per cycle. Using FXCH is preferred over the use of FST/FLD pairs, even if 
the FST/FLD pair works on a register. An FST/FLD pair adds two cycles of latency and consists of 
two macro-ops. 
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10.5 Floating-Point Subexpression Elimination 

Optimization 

Reduce the number of superfluous FXCH instructions by putting the shared source operand at the top 
of the stack to eliminate subexpressions. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

There are cases that do not require an FXCH instruction after every instruction to allow access to two 
new stack entries. In the cases where two instructions share a source operand, an FXCH is not 
required between the two instructions. When there is an opportunity for subexpression elimination, 
reduce the number of superfluous FXCH instructions by putting the shared source operand at the top 
of the stack—for example: 

Examples 

Listing 29. Avoid 


; func((x*y),(x+z)) 


fid 

X 

1 

X 


fid 

y 

1 

y x 


fid 

X 

1 

x y 

X 

fid 

z 

1 

Z X 

y x 

f addp 

St (1 

) , St 

x+z 

y x 

fxch 

st (2 

) 

x y 

x+z 

fmulp 

st (1 

) , st 

x*y 

x+z 

Listing 30. 

Preferred 



fid 

z 

i 

z 


fid 

y 

i 

y z 


fid 

X 

i 

x y 

z 

fmul 

st (1) 

, st 

x x*y z 

f addp 

St (2 

) , st 

x*y 

x+z 
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10.6 Accumulating Precision-Sensitive Quantities in 
x87 Registers 

Optimization 

Accumulate results in the x87 registers rather than the SSE and SSE2 XMM registers, if more than 
64 bits of accuracy are required. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

More than 64 bits of accuracy may be required, as when accumulating a result (for example, during 
the calculation of dot product). The precision of floating-point operations in the x87 registers ST(0)- 
ST(7) is 80 bits internally, whereas the precision of operations using SIMD instructions is only 
64 bits. 
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10.7 Avoiding Extended-Precision Data 

Optimization 

Store floating-point data in single-precision or double-precision format. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Loading and storing extended-precision data is significantly slower than storing single- or double¬ 
precision data. 
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Appendix A Microarchitecture for 

AMD Athlon™ 64 and 
AMD Opteron™ Processors 


When discussing processor design, it is important to understand the terms architecture , 
microarchitecture , and design implementation. 

The architecture consists of the instruction set and those features of a processor that are visible to 
software programs running on the processor. The architecture determines what software the processor 
can run. The AMD64 architecture of the AMD Athlon™ 64 and AMD Opteron™ processors is 
compatible with the industry-standard x86 instruction set. 

The term microarchitecture refers to the design features used to reach the target cost, performance, 
and functionality goals of the processor. The AMD64 architecture employs a decoupled 
decode/execution design approach. In other words, decoders and execution units essentially operate 
independently; the execution core uses a small number of instructions and simplified circuit design 
for fast single-cycle execution and fast operating frequencies. 

The design implementation refers to a particular combination of physical logic and circuit elements 
that comprise a processor that meets the microarchitecture specifications. 

This appendix covers the following topics: 


Topic 

Page 

Key Microarchitecture Features 

250 

Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors 

251 

Superscalar Processor 

251 

Processor Block Diagram 

251 

LI Instruction Cache 

252 

Branch-Prediction Table 

253 

Fetch-Decode Unit 

254 

Instruction Control Unit 

254 

Translation-Lookaside Buffer 

254 

LI Data Cache 

255 

Integer Scheduler 

256 

Integer Execution Unit 

256 

Floating-Point Scheduler 

257 

Floating-Point Execution Unit 

258 
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Topic 

Page 

Load-Store Unit 

258 

L2 Cache 

259 

Write-combining 

260 

Buses for AMD Athlon™ 64 and AMD Opteron™ Processor 

260 

Integrated Memory Controller 

260 

HyperTransport™ Technology Interface 

260 


A.1 Key Microarchitecture Features 

The AMD Athlon 64 and AMD Opteron processors include many features designed to improve 
software performance. The internal design, or microarchitecture , of these processors provides the 
following key features: 

• Integrated DDR memory controller 

• 64-Kbyte LI instruction cache and 64-Kbyte LI data cache 

• On-chip L2 cache 

• Instruction predecode and branch prediction during cache-line fills 

• Decoupled decode/execution core 

• Three-way AMD64 instruction decoding 

• Dynamic scheduling and speculative execution 

• Three-way integer execution 

• Three-way address generation 

• Three-way floating-point execution 

• 3DNow!™ technology, MMX™, SSE, and SSE2 single-instruction multiple-data (SIMD) 
instruction extensions 

• Superforwarding 

• Deep out-of-order integer and floating-point execution 

• In 64-bit mode, eight additional XMM registers (for use with SSE and SSE2 instructions) and 
eight additional general-purpose registers (GPRs) 

• HyperTransport™ technology 
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A.2 Microarchitecture for AMD Athlon™ 64 and 

AMD Opteron™ Processors 

The AMD Athlon 64 and AMD Opteron processors implement the AMD64 instruction set by means 
of micro-ops —simple fixed-length operations designed to include direct support for AMD64 
instructions and adhere to the high-performance principles of fixed-length encoding, regularized 
instruction fields, and a large register set. The enhanced microarchitecture enables higher processor 
core performance and promotes straightforward extensibility for future designs. 

A.3 Superscalar Processor 

The AMD Athlon 64 and AMD Opteron processors are aggressive, out-of-order, three-way 
superscalar AMD64 processors. They can fetch, decode, and issue up to three AMD64 instructions 
per cycle with a centralized instruction control unit (ICU) and two independent instruction 
schedulers—an integer scheduler and a floating-point scheduler. These two schedulers can 
simultaneously issue up to nine micro-ops to the three general-purpose integer execution units 
(ALUs), three address-generation units (AGUs), and three floating-point execution units. The 
processors move integer instructions down the integer execution pipeline, which consists of the 
integer scheduler and the ALUs, as shown in Figure 6 on page 252. Floating-point instructions are 
handled by the floating-point execution pipeline, which consists of the floating-point scheduler and 
the floating-point execution units. 

A.4 Processor Block Diagram 

A block diagram of the AMD Athlon 64 and AMD Opteron processors is shown in Figure 6 on 
page 252. 
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Figure 6. AMD Athlon™ 64 and AMD Opteron™ Processors Block Diagram 

A.5 LI Instruction Cache 

The out-of-order execution engine of the AMD Athlon 64 and AMD Opteron processors contains a 
very large LI instruction cache. Each line in this cache is 64 bytes long. Functions associated with the 
LI instruction cache are instruction loads, instruction prefetching, instruction predecoding, and 
branch prediction. Requests that miss in the LI instruction cache are fetched from the L2 cache or, 
subsequently, from the local memory using the integrated memory controller. 

The LI instruction cache generates fetches on the naturally aligned 64 bytes containing the 
instructions and the next sequential line of 64 bytes (a prefetch). The principle of program-spatial 
locality makes code prefetching very effective and avoids or reduces execution stalls caused by the 
amount of time required to read the necessary code. Cache-line replacement is based on a least- 
recently-used replacement algorithm. 
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Table 7 provides specifications on the LI instruction cache for various AMD processors. 


Table 7. LI Instruction Cache Specifications by Processor 


Processor name 

Family 

Model 

Associativity 

Size (Kbytes) 

AMD Athlon™ XP 
processor 

6 

6 

2 ways 

64 

AMD Athlon™ 64 
processor 

15 

All 

2 ways 

64 

AMD Opteron™ 
processor 

15 

All 

2 ways 

64 


Predecoding begins as the LI instruction cache is filled. Predecode information is generated and 
stored alongside the instruction cache. This information is used to help efficiently identify the 
boundaries between variable length AMD64 instructions. 

A.6 Branch-Prediction Table 

The AMD Athlon 64 and AMD Opteron processors assume that a branch is not taken until it is taken 
once. Then it is assumed that the branch is taken, until it is not taken. Thereafter, the branch 
prediction table is used. 

The fetch logic accesses the branch prediction table in parallel with the LI instruction cache. The 
information stored in the branch prediction table is used to predict the direction of branch 
instructions. When instruction cache lines are evicted to the L2 cache, branch selectors and predecode 
information are also stored in the L2 cache. 

The AMD Athlon 64 and AMD Opteron processors employ combinations of a branch target address 
buffer (BTB), a global history bimodal counter (GHBC) table, and a return address stack (RAS) to 
predict and accelerate branches. Predicted-taken branches incur only a single-cycle delay to redirect 
the instruction fetcher to the target instruction. In the event of a misprediction, the minimum penalty 
is 10 cycles. 

The BTB is a 2048-entry table that caches in each entry the predicted target address of a branch. The 
16384-entry GHBC table contains 2-bit saturating counters used to predict whether a conditional 
branch is taken. The GHBC table is indexed using the outcome (taken or not taken) of the last eight 
conditional branches and 4 bits of the branch address. The GHBC table allows the processors to 
predict branch patterns of up to eight branches. 

In addition, the processors implement a 12-entry return address stack to predict return addresses from 
a near or far call. As calls are fetched, the next rIP is pushed onto the return stack. Subsequent returns 
pop a predicted return address off the top of the stack. 
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A.7 Fetch-Decode Unit 

The fetch-decode unit performs early decoding of AMD64 instructions into macro-ops. The outputs 
of the early decoders keep all (DirectPath or VectorPath) instructions in program order. Early 
decoding produces three macro-ops per cycle from either path. The outputs of both decoders are 
multiplexed together and passed to the next stage in the pipeline, the instruction control unit. 
Decoding a VectorPath instruction may prevent simultaneously decoding of a DirectPath instruction. 

When the target 16-byte instruction window is obtained from the LI instruction cache, the instruction 
bytes are examined to determine whether the type of basic decode to occur is DirectPath or 
VectorPath. 


A.8 Instruction Control Unit 

The instruction control unit (ICU) is the control center for the AMD Athlon 64 and AMD Opteron 
processors. It controls the centralized in-flight reorder buffer, the integer scheduler, and the floating¬ 
point scheduler. In turn, the ICU is responsible for the following functions: macro-op dispatch, 
macro-op retirement, register and flag dependency resolution and renaming, execution resource 
management, interrupts, exceptions, and branch mispredictions. 

The instruction control unit takes the three macro-ops per cycle from the early decoders and places 
them in a centralized, fixed-issue reorder buffer. This buffer is organized into 24 lines of three macro¬ 
ops each. The reorder buffer allows the instruction control unit to track and monitor up to 72 in-flight 
macro-ops (whether integer or floating-point) for maximum instruction throughput. The instruction 
control unit can simultaneously dispatch multiple macro-ops from the reorder buffer to both the 
integer and floating-point schedulers for final decode, issue, and execution as micro-ops. In addition, 
the instruction control unit handles exceptions and manages the retirement of macro-ops. 

A.9 Translation-Lookaside Buffer 

A translation-lookaside buffer (TLB) is a special on-chip cache that holds a table that matches the 
most-recently-used virtual addresses to their physical addresses. 

The AMD Athlon 64 and AMD Opteron processors utilize a two-level TLB structure. A flush filter— 
new on the AMD Athlon 64 and AMD Opteron processors—eliminates unnecessary TLB flushes 
when loading the CR3 register. 

LI Instruction TLB Specifications 

Table provides the specifications of the LI instruction TLB for various AMD processors. 
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Table 8. LI Instruction TLB Specifications 



Number of Entries 

Processor Name 

Family 

Model 

Associativity 

2-Mbyte Pages 1 

4-Kbyte Pages 

AMD Athlon™ XP Processor 

6 

6 

Full 

8 

16 

AMD Athlon™ 64 Processor 

15 

All 

Full 

8 

32 

AMD Opteron™ Processor 

15 

All 

Full 

8 

32 

Note: 

1. The number of entries available for 4-Mbyte pages is one-half this value (4-Mbyte pages require two 2-Mbyte 
entries). 


LI Data TLB Specifications 

Table 9 provides the specifications of the LI data TLB for various AMD processors. 

Table 9. LI Data TLB Specifications 



Number of Entries 

Processor Name 

Family 

Model 

Associativity 

2-Mbyte pages 1 

4-Kbyte pages 

AMD Athlon™ XP Processor 

6 

6 

Full 

8 

32 

AMD Athlon™ 64 Processor 

15 

All 

Full 

8 

32 

AMD Opteron™ Processor 

15 

All 

Full 

8 

32 

Note: 

1. The number of entries available for 4-Mbyte pages is one-half this value (4-Mbyte pages require two 2-Mbyte 
entries). 


L2 TLB Specifications 

Table 10 provides the specifications on the L2 TLB for various AMD processors. 

Table 10. L2 TLB Specifications by Processor 


Processor Name 

Family 

Model 

Associativity 

Number of Entries (4-Kbyte Pages) 

AMD Athlon™ XP Processor 

6 

6 

4 ways 

256 

AMD Athlon™ 64 Processor 

15 

All 

4 ways 

512 

AMD Opteron™ Processor 

15 

All 

4 ways 

512 


A.10 LI Data Cache 

The LI data cache contains two 64-bit ports. It is a write-allocate and writeback cache that uses a 
least-recently-used replacement policy. It is divided into eight banks, each eight bytes wide. In 
addition, the LI cache supports the MOESI (Modified, Owner, Exclusive, Shared, and Invalid) cache- 
coherency protocol and data parity. 
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Table 11 provides specifications on the LI data cache for various AMD processors. 

Table 11. LI Data Cache Specifications by Processor 


Processor name 

Family 

Model 

Associativity 

Size (Kbytes) 

AMD Athlon™ XP 
Processor 

6 

6 

2 ways 

64 

AMD Athlon™ 64 
Processor 

15 

All 

2 ways 

64 

AMD Opteron™ 
Processor 

15 

All 

2 ways 

64 


A.11 Integer Scheduler 

The integer scheduler is based on a three-wide queuing system (also known as a reservation station) 
that feeds three integer execution positions or pipes. The reservation stations are eight entries deep, 
for a total queuing system of 24 integer macro-ops. Each reservation station divides the macro-ops 
into integer and address generation micro-ops, as required. 

A.12 Integer Execution Unit 

The integer execution pipeline consists of three identical pipes—0, 1, and 2. Each integer pipe 
consists of an integer execution unit—or arithmetic-logic unit (ALU)—and an address generation unit 
(AGU). The integer execution pipeline is organized to match the three macro-op dispatch pipes in the 
ICU as shown in Figure 7. 


Figure 7. 



Macro-ops 


Micro-ops 


Integer Execution Pipeline 


Macro-ops are broken down into micro-ops in the schedulers. Micro-ops are executed when their 
operands are available, either from the register file or result buses. Micro-ops from a single operation 
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can execute out-of-order. In addition, a particular integer pipe can execute two micro-ops from 
different macro-ops (one in the ALU and one in the AGU) at the same time. See Figure 7 on 
page 256. 

Each of the three ALUs performs general purpose logic functions, arithmetic functions, conditional 
functions, divide step functions, status flag multiplexing, and branch resolutions. The AGUs calculate 
the logical addresses for loads, stores, and LEAs. A load and store unit reads and writes data to and 
from the LI data cache. The integer scheduler sends a completion status to the ICU when the 
outstanding micro-ops for a given macro-op are executed. 

All integer operations can be handled within any of the three ALUs with the exception of multiplies. 
Multiplies are handled by a pipelined multiplier that is attached to the pipeline at pipe 0, as shown in 
Ligure 7. Multiplies always issue to integer pipe 0, and the issue logic creates results bus bubbles for 
the multiplier in integer pipes 0 and 1 by preventing non-multiply micro-ops from issuing at the 
appropriate time. 

A.13 Floating-Point Scheduler 

The floating-point logic of the AMD Athlon 64 and AMD Opteron processors is a high-performance, 
fully pipelined, superscalar, out-of-order execution unit. It is capable of accepting three macro-ops 
per cycle from any mixture of the following types of instructions: 

• x87 floating-point 

• 3DNow! technology 

• MMX technology 

• SSE 

• SSE2 

The floating-point scheduler handles register renaming and has a dedicated 36-entry scheduler buffer 
organized as 12 lines of three macro-ops each. It also performs data superforwarding, micro-op issue, 
and out-of-order execution. The floating-point scheduler communicates with the ICU to retire a 
macro-op, to manage comparison results from the LCOMI instruction, and to back out results from a 
branch misprediction. 

Superforwarding is a performance optimization. It allows a floating point operation having a 
dependency on a register to be scheduled sooner when that register is waiting to be filled by a pure 
load from memory. Instead of waiting for the first instruction to write its load-data to the register and 
then waiting for the second instruction to read it, the load-data can be provided directly to the 
dependent instruction, much like regular forwarding between LPU-only operations. The result from 
the load is said to be "superforwarded" to the floating-point operation. In the following example, the 
LADD can be scheduled to execute as soon as the load operation fetches its data rather than having to 
wait and read it out of the register file. 
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fid [somefloat] ;Load a floating point 

;value from memory into ST(0) 
fadd st(0),st(l) ;The data from the load will be 

,-forwarded directly to this instruction, 

;no need to read the register file 

A.14 Floating-Point Execution Unit 

The floating-point execution unit (FPU) is implemented as a coprocessor having its own out-of-order 
control in addition to the data path. The FPU handles all register operations for x87 instructions, all 
3DNow! technology operations, all MMX operations, and all SSE and SSE2 operations. The FPU 
consists of a stack renaming unit, a register renaming unit, a scheduler, a register file, and three 
parallel execution units. Figure 8 shows a block diagram of the dataflow through the FPU. 



Figure 8. Floating-Point Unit 


As shown in Figure 8, the floating-point logic uses three separate execution positions or pipes. The 
first of the three pipes is generally known as the adder pipe (FADD), and it contains an MMX 
AFU/shifter and floating-point add execution units. The second pipe is known as the multiplier 
(FMUF). It contains the floating-point multiplier/divider/square root unit and also an MMX AFU. 
The third pipe is known as the floating-point load/store (FSTORE), which handles floating-point 
stores and many micro-op primitives used in VectorPath sequences. 


A.15 Load-Store Unit 

The load-store unit (FSU) is shown in Figure 9. It manages data load and store accesses to the FI data 
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cache and, if required, to the L2 cache or system memory. The 44-entry LSU provides a data interface 
for both the integer scheduler and the floating-point scheduler. It consists of two queues—a 12-entry 
queue for LI cache load and store accesses and a 32-entry queue for L2 cache or system memory load 
and store accesses. The 12-entry queue can request a maximum of two LI cache operations (and mix 
of loads and stores) per cycle. Up to two 64-bit stores can be performed per cycle. In other words, 

16 bytes per clock is the maximum rate at which the processor can move data. The 32-entry queue 
effectively holds requests that missed in the LI cache probe by the 12-entry queue. Finally, the LSU 
helps ensure that the architectural load and store ordering rules are preserved (a requirement for 
AMD64 architecture compatibility). 



Figure 9. Load-Store Unit 


Data 
Cache 
2-Way 
64 Kbytes 



Store Data 
to BIU 


A.16 L2 Cache 

The AMD Athlon 64 and AMD Opteron processors each contain an integrated L2 cache. This full- 
speed on-die L2 cache features an exclusive cache architecture. The L2 cache contains only victim or 
copy-back cache blocks that are to be written back to the memory subsystem as a result of a conflict 
miss. These terms, victim or copy-back, refer to cache blocks that were previously held in the LI 
cache but had to be overwritten (evicted) to make room for newer data. The victim buffer contains 
data evicted from the LI cache. 
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The L2 cache in the AMD Athlon XP, AMD Athlon™ 64, and AMD Opteron processors is 16-way 
associative. 


A.17 Write-combining 

See Appendix B, “Implementation of Write-Combining,” on page 263 for detailed information about 
write-combining. 

A.18 Buses for AMD Athlon™ 64 and AMD Opteron™ 
Processor 

AMD Athlon 64 and AMD Opteron processors feature an integrated memory controller and 
HyperTransport technology for interfacing to I/O devices. These integrated features, along with other 
logic, bring the Northbridge functionality onto the processor. 

A.19 Integrated Memory Controller 

AMD Athlon 64 and AMD Opteron processors provide an integrated low-latency, high-bandwidth 
DDR memory controller. 

The memory controller supports: 

• DRAM devices that are 4, 8, and 16 bits wide. 

• Interleaving memory within DIMMs. 

• ECC checking with double-bit detection and single-bit correction. 

For specifications on a certain processor’s memory controller, see the data sheet for that processor. 
For information on how to program the memory controller, see the BIOS and Kernel Developer s 
Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, order# 26094. 

A.20 HyperTransport™ Technology Interface 

HyperTransport technology is a scalable, high-speed, low-latency, point-to-point, packetized link 
that: 

• Enables data transfer rates of up to 8 Gbytes/s (4 Gbytes/s in each direction simultaneously with a 
16-bit link). 

• Simplifies connectivity by replacing legacy buses and bridges. 

• Reduces latencies and bottlenecks within systems. 
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When compared with traditional technologies, HyperTransport technology allows much faster data- 
transfer rates. For more information on HyperTransport technology, see the HyperTransport I/O Link 
Specification , available at www.hypertransport.org. 

HyperTransport™ Technology 

On AMD Athlon 64 and AMD Opteron processors, HyperTransport technology provides the link to 
I/O devices. Some processor models—for example, those designed for use in multiprocessing 
systems—also utilize HyperTransport technology to connect to other processors. See the BIOS and 
Kernel Developer's Guide for your particular processor for details concerning HyperTransport 
technology implementation details. 
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Appendix B Implementation of 

Write-Combining 


This appendix describes the memory write-combining feature implemented in the AMD Athlon™ 64 
and AMD Opteron™ processors. Write-combining is the merging of multiple memory write cycles 
that target locations within the address range of a write buffer. 

The AMD Athlon 64 and AMD Opteron processors support the memory type range register (MTRR) 
and the page attribute table (PAT) extensions, which allow software to define ranges of memory as 
either writeback (WB), write-protected (WP), writethrough (WT), uncacheable (UC), or write¬ 
combining (WC). 

Defining the memory type for a range of memory as WC or WT allows the processor to conditionally 
combine data from multiple write cycles that are addressed within this range into a merge buffer. 
Merging multiple write cycles into a single write cycle reduces processor bus utilization and 
processor stalls. Write combining buffers are also used for streaming store instructions such as 
MOVNTQ and MOVNTI. See “Streaming-Store/Non-Temporal Instructions” on page 112. 

This appendix covers the following topics: 


Topic 

Page 

Write-Combining Definitions and Abbreviations 

263 

Programming Details 

264 

Write-combining Operations 

264 

Sending Write-Buffer Data to the System 

266 

Write-Combining Optimization on Revision D and E AMD Athlon™ 64 and AMD Opteron™ 
Processors 

266 


B.1 Write-Combining Definitions and Abbreviations 

This appendix uses the following definitions and abbreviations: 

• MTRR—Memory type range register 

• PAT—Page attribute table 

• UC—Uncacheable memory type 

• WC—Write-combining memory type 

• WT—Writethrough memory type 

• WP—Write-protected memory type 
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• WB—Writeback memory type 

• One Byte—8 bits 

• One Word—16 bits 

• Doubleword—32 bits 

• Quadword—64 bits or 2 doublewords 

• Cache Block—64 bytes or 4 octawords or 8 quadwords 

B.2 Programming Details 

The following steps are required for programming write-combining on the AMD Athlon 64 and 
AMD Opteron processors: 

1. Verify the presence of an AMD Athlon™ 64 or AMD Opteron processor by using the CPUID 
instruction to check for the instruction family code and vendor identification of the processor. 
Standard function 0 on AMD processors returns a vendor identification string of 
“AuthenticAMD” in registers EBX, EDX, and ECX. Standard function 1 returns the processor 
signature in register EAX, where EAX[11:8] contains the instruction family code. For the 
AMD Athlon 64 and AMD Opteron processors, the instruction family code is Fh. 

2. Verify the presence of the MTRRs and the PAT extensions. The presence of the MTRRs is 
indicated by bit 12 and the presence of the PAT extensions is indicated by bit 16 of the extended 
features bits returned in the EDX register by CPETID function 8000_0001h. See the CPUID 
Specification , order# 25481, for more details on the CPETID instruction. 

3. Enable write-combining. Write-combining is controlled by the MTRRs and PAT extensions. 
Write-combining should be enabled for the appropriate memory ranges. For more information on 
the MTRRs and the PAT extensions, see volume 2 of the AMD64 Architecture Programmer s 
Manual , order# 24593. 

B.3 Write-combining Operations 

To improve system performance, the AMD Athlon 64 and AMD Opteron processors aggressively 
combine multiple memory-write cycles of any data size that address locations within a 64-byte write 
buffer that is aligned to a cache-line boundary. The processor continues to combine writes to this 
buffer without writing the data to the system, as long as certain rules apply (see Table 12 on page 265 
for more information). The data sizes can be bytes, words, doublewords, or quadwords. 

• WC memory type writes can be combined in any order up to a full 64-byte write buffer. 

• WT memory type writes can only be combined up to a fully aligned quadword in the 64-byte 
buffer, and must be combined contiguously in ascending order. Combining may be opened at any 
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byte boundary in a quadword, but is closed by a write that is either not “contiguous and 
ascending” or fills byte 7. 

• All other memory types for stores that go through the write buffer (UC and WP) cannot be 

combined except when the WB memory type is over-ridden for streaming store instructions such 
as the MOVNTQ and MOVNTI instructions, etc. These instructions use the write buffers and will 
be write-combined in the same way as address spaces mapped by the MTTR registers and PAT 
extensions. When WC is used for streaming store instructions, then the buffers are subject to the 
same flushing events as write-combined address spaces. 

Combining is able to continue until interrupted by one of the conditions listed in Table 12 on 
page 265. When combining is interrupted, one or more bus commands are issued to the system for 
that write buffer, as described in “Sending Write-Buffer Data to the System” on page 266. 


Table 12. Write-Combining Completion Events 


Event 

Comment 

Non-WB write outside of current 
buffer 

(On revisions A-C processors only) The first non-WB write to a 
different cache block address closes combining for previous writes. 
WB writes do not affect write-combining. Only one line-sized buffer 
can be open for write-combining at a time. Once a buffer is closed for 
write-combining, it cannot be reopened for write-combining. 

I/O Read or Write 

Any IN/INS or OUT/OUTS instruction closes combining. The implied 
memory type for all IN/OUT instructions is UC, which cannot be 
combined. 

Serializing instructions 

Any serializing instruction closes combining. These instructions 
include: MOVCRx, MOVDRx, WRMSR, INVD, INVLPG, WBINVD, 
LGDT, LLDT, LIDT, LTR, CPUID, IRET, RSM, IN IT, and HALT. 

Flushing instructions 

Any flush instruction causes the WC to complete. 

Locks 

Any instruction or processor operation that requires a cache or bus 
lock closes write-combining before starting the lock. Writes within a 
lock can be combined. 

Uncacheable Read 

A UC read closes write-combining. A WC read closes combining 
only if a cache block address match occurs between the WC read 
and a write in the write buffer. 

Different memory type 

Any WT write while write-combining for WC memory or any WC write 
while write-combining for WT memory closes write-combining. 

Buffer full 

Write-combining is closed if all 64 bytes of the write buffer are valid. 

WT time-out 

If 16 processor clocks have passed since the most recent write for 

WT write-combining, write-combining is closed. There is no time-out 
for WC write-combining. 

WT write fills byte 7 

Write-combining is closed if a write fills the most significant byte of a 
quadword, which includes writes that are misaligned across a 
quadword boundary. In the misaligned case, combining is closed by 
the LS part of the misaligned write and combining is opened by the 
MS part of the misaligned store. 
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Table 12. Write-Combining Completion Events (Continued) 


Event 

Comment 

WT Nonsequential 

If a subsequent WT write is not in ascending sequential order, the 
write-combining completes. WC writes have no addressing 
constraints within the 64-byte line being combined. 

TLB AD bit set 

Write-combining is closed whenever a TLB reload sets the accessed 
[A] or dirty [D] bits of a Pde or Pte. 


B.4 Sending Write-Buffer Data to the System 

The maximum write combined throughput is achieved when all quadwords or doublewords are valid 
and the AMD Athlon 64 and AMD Opteron processors can use one efficient 64-byte memory write 
instead of multiple 8-byte memory writes. 

B.5 Write-Combining Optimization on 

Revision D and E AMD Athlon™ 64 and 
AMD Opteron™ Processors 

The number of Write Combining buffers on revision D and revision E AMD Athlon 64 and AMD 
Opteron processors has changed from earlier CPU revisions. Although the number of buffers 
available for write combining depends on the specific CPU revision, current designs provide as many 
as four write buffers for WC memory mapped I/O address spaces. These same buffers are used for 
streaming store instructions. The number of write-buffers determines how many independent linear 
64-byte streams of WC data the CPU can simultaneously buffer. 

Having multiple write-combining buffers that can combine independent WC streams has implications 
on data throughput rates (bandwidth), especially when data is written by the CPU to WC memory 
mapped I/O devices, residing on the AGP, PCI, PCI-X and PCI-E busses including: 

• Memory Mapped I/O registers—command FIFO, etc. 

• Memory Mapped I/O apertures—windows to which the CPU use programmed EO to send data to 
a hardware device 

• Sequential block of 2D/3D graphic engine registers written using programmed I/O 

• Video memory residing on the graphics accelerator—frame buffer, render buffers, textures, etc. 

HyperTransport tunnels are HyperTransport-to-bus bridges. There are tunnels for AGP, PCI Express, 
PCI and PCI-X. Examples of tunnels are the AMD-8151™ graphics tunnel, the AMD-8131™ EO 
bus tunnel, and the AMD-8132™ PCI-X tunnel. Many HyperTransport tunnels use a hardware 
optimization feature called write-chaining. In write-chaining, the tunnel device buffers and combines 
separate HyperTransport packets of data sent by the CPU, creating one large burst on the underlying 
bus when the data is received by the tunnel in sequential address order. Using larger bursts results in 
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better throughput since bus efficiency is increased. This is because bus arbitration overhead is lower: 
only one address/attribute phase is issued per burst in the PCI-X case, and one address/command 
phase is issued for the AGP Fast Writes case. An illustration of address phase overhead on AGP Fast 
Writes is provided in Figure 10 on page 346 in Appendix D, AGP Considerations. 

For reasons cited in the precding paragraph, to utilize hardware write chaining efficiently, software 
should flush the CPU write-combining buffer in sequential linear address order, any time a target 
hardware device is capable of receiving large bursts of CPU write data. 

Software should be aware that on AMD64 processors that have multiple write-combining buffers (i.e. 
Rev. D, and E processors), events that flush the write-combining buffers (see Appendix B, Table 8.) 
will send out the 64-byte WC buffers in the order that the streams were opened. This means that if the 
CPU writes to the WC space in the highest 64-byte addressed buffer first (for example address 40h), 
and then writes to a lower 64-byte buffer next, (for example address OOh), when those buffers are sent 
by the CPU (by HyperTransport to the tunnel), the highest address 64-byte buffer will be sent first, 
followed by the second (lower address) 64-byte buffer. Since the addressing is not sequential the 
tunnel device will not "chain" both 64-byte WC buffers and must issue 2 separate transactions on the 
target bus. 

If the above example were targeted for AGP fast writes, issuing two fast write transactions (rather 
than issuing one Fast Write transaction) will reduce the bandwidth (data throughput) by 1/3. See 
Figure 10 on page 346 in Appendix D. 

Optimizations 

Adhere to the following guidelines to ensure that Revision D and E AMD Athlon 64 and AMD 
Opteron processors issue WC buffers in sequential address order: 

• When practical, shadow the data structure in memory (rather than writing the actual WC buffer in 
MMI/O space), prior to copying the structure to WC MMI/O space. This will also ensure that the 
write-combining buffers are not emptied prematurely by external events (such as a UC read— 
perhaps issued by another device driver thread or a hardware interrupt, etc.). Shadowing also 
ensures that writes that occur to different cache lines in the structure do not send out the WC 
buffers, since the number of WC buffers that can be open at one time is CPU implementation 
dependent. 

• When ready to update the actual WC MMI/O address space, copy the shadowed structure from 

memory to MMEO, from the lowest address 64-byte block upward. To do the copy, use discrete 
loads and stores for up to 64 bytes of data. Use a loop of discrete loads and stores for up to 4KB of 
data. Up to 32KB use REP MOVS instructions. To do discrete loads use assembly language, or, if 
available, compiler intrinsic functions available (_movsb(), __movsw(), _movsd()), etc. 

• In general, using these methods to do the copy will exhibit less overhead in a data movement 
function than calling a memcpy() LIBC function, which is usually optimized for copying larger 
blocks of memory. 
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Appendix C Instruction Latencies 


This appendix provides a complete listing of all AMD64 instructions, along with their encodings, 
decode types, and execution latencies. For more information on these instructions, see volumes 3, 4, 
and 5 of the AMD64 Architecture Programmer s Manual (order# 24594, 26568, and 26569). 

Note: Some prior AMD documents referred to one group of instructions as MMX™ technology 
extensions. Those instructions are still supported by the AMD Athlon™ 64 and 
AMD Opteron™processors, but are documented with the SSE instructions in this guide. (The 
MMX™ technology instructions remain a separate group.) 

The instruction entries in this appendix are grouped into categories as indicated in the following table 
and are presented within each category in alphabetical order by mnemonic: 


Topic 

Page 

Understanding Instruction Entries 

270 

Integer Instructions 

273 

MMX™ Technology Instructions 

303 

x87 Floating-Point Instructions 

307 

3DNowl™ Technology Instructions 

314 

3DNowl™ Technology Extensions 

316 

SSE Instructions 

317 

SSE2 Instructions 

326 

SSE3 Instructions 

342 
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C.1 Understanding Instruction Entries 

To use the information in this appendix effectively, you need to understand how the entry for an 
instruction is organized and how to interpret certain items. 

Example: Instruction Entry 

The entry for an instruction begins with its syntax. Subsequent columns provide additional 
information about the instruction. 



Encoding 

Decode 

type 



Syntax 

First 

byte 

Second 

byte 

ModRM 

byte 

Latency 

Note 

ADD mreg8, reg8 

OOh 


11 -xxx-xxx 

DirectPath 

1 



Parts of the Instruction Entry 

This table describes the columns that are common to each instruction entry in this appendix. 


Column 

Description 

Syntax 

Shows the syntax for the instruction—the permitted arrangement of its parts. Items in 
italics are placeholders for operands that you must provide. For information on how to 
interpret the placeholders, see “Interpreting Placeholders” on page 271 

Encoding 

Shows how the assembler translates the instruction into machine language. 

Subcolumns show the individual bytes of the encoding. 

Decode type 

Shows the method that the processor uses to decode the instruction—either DirectPath 
Single (DirectPath), DirectPath Double (Double), or VectorPath. 

Latency 

Shows the static execution latency for the instruction. For details on how to interpret the 
latency information, see “Interpreting Latencies” on page 272. 

Throughput 

This value indicates the maximum theoretical rate of execution of that instruction. For 
example, a value of 1/2 means that one such instruction executes every two clocks, or 
two such instructions in four clocks and so on. A value of 3/1 indicates that three such 
instructions can be executed every clock, but fewer than three such instructions would 
still take one clock. 


The entries for floating-point, MMX, SSE, and SSE2, and 3DNow!™ instructions have an additional 
column [FPU Pipe(s)] that lists the possible floating-point unit (FPU) pipelines available for use by 
any particular DirectPath or Double decoded operation. For example, the floating point multiplier is 
represented by FMUF. 
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Interpreting Placeholders 

The Syntax column for an instruction entry shows the mnemonic for the instruction followed by any 
operands. Items in italics are placeholders for operands that you must provide. A placeholder 
indicates the size and type of operand that is allowed. 


This operand 

Is a placeholder for 

disp8 

A byte (8-bit) displacement value 

disp 16/32 

A word (16-bit) or doubleword (32-bit) displacement value 

disp32/48 

A doubleword (32-bit) or 48-bit displacement value 

imm8 

A byte (8-bit) immediate value 

imm16 

A word (16-bit) immediate value 

imm32 

A doubleword (32-bit) immediate value 

mem8 

A byte (8-bit) memory location 

mem 16/32/64 

A memory location that contains a word, doubleword, or quadword 

meml 6/32& meml 6/32 

A memory location that contains a pair of words or doublewords 

mem32/48 

A doubleword (32-bit) or 48-bit memory location 

mem48 

A 48-bit memory location 

mem64 

A quadword (64-bit) memory location 

meml 28 

A double quadword (128-bit) memory location 

mem32real 

A memory location that contains a single-precision (32-bit) floating-point value 

mem64real 

A memory location that contains a double-precision (64-bit) floating-point value 

mem80real 

A memory location that contains a double-extended-precision (80-bit) floating-point 
value 

mmreg 

An MMX™ register 

mmreg 1 

An MMX register defined by bits 5, 4, and 3 of the ModRM byte 

mmreg2 

An MMX register defined by bits 2, 1, and 0 of the ModRM byte 

mreg8 

A byte general-purpose register defined by the r/m field (bits 2, 1, and 0) of the 
ModRM byte 

mreg 16/32/64 

A word, doubleword, or quadword general-purpose register defined by the r/m field 
(bits 2, 1, and 0) of the ModRM byte 

reg8 

A byte general-purpose register defined by instruction byte(s) or the reg field (bits 5, 

4, and 3) of the ModRM byte 

reg 16/32/64 

A word, doubleword, or quadword general-purpose register defined by instruction 
byte(s) or the reg field (bits 5, 4, and 3) of the ModRM byte 

sreg 

A segment register (always 16 bits wide) 

xmmreg 

An XMM register 

xmmregl 

An XMM register defined by bits 5, 4, and 3 of the ModRM byte 

xmmreg2 

An XMM register defined by bits 2, 1, and 0 of the ModRM byte 
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Interpreting Latencies 

The Latency column for an instruction entry shows the static execution latency for the instruction. 
The static execution latency is the number of clock cycles it takes to execute the serially dependent 
sequence of micro-ops that comprise the instruction. 

The latencies in this appendix are estimates and are subject to change. They assume that: 

• The instruction is an LI-cache hit that has already been fetched and decoded, with the operations 
loaded into the scheduler. 

• Memory operands are assumed to be in the LI data cache. 

• There is no contention for execution resources or load-store unit resources. 

The following formats are used to indicate the static execution latency: 


Latency format 

Description 

Example 

X 

The latency is the indicated value. 

3 

x-y 

The latency is a value greater than or equal to x and less than or 
equal to y. 

31-73 

xtytz 

The latency differs according to the size of the operands. The values 
x, y and zare the 16-, 32-, and 64-bit latencies, respectively. 

26/42/74 

x(y) 

The latency depends on whether an error condition exists. When 
there is no error condition, x is the latency. When an error condition 
exists, y is the latency. 

68 (108) 

~ 

The latency is unavailable. 
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C.2 Integer Instructions 


Table 13. Integer Instructions 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

AAA 

37h 



VectorPath 

5 


AAD (or directly coded D5 ib, where ib is a byte 
value other than OAh) 

D5h 

OAh 


VectorPath 

5 


AAM (or directly coded D4 ib, where ib is a 
byte value other than OAh) 

D4h 

OAh 


VectorPath 

15 


AAS 

3Fh 



VectorPath 

5 


ADC mreg8, reg8 

lOh 


11 -xxx-xxx 

DirectPath 

1 


ADC mem8, reg8 

lOh 


mm-xxx-xxx 

DirectPath 

4 


ADC mregl6/32/64, reg16/32/64 

11 h 


11 -xxx-xxx 

DirectPath 

1 


ADC mem 16/32/64, reg16/32/64 

11 h 


mm-xxx-xxx 

DirectPath 

4 


ADC reg8, mreg8 

12h 


11 -xxx-xxx 

DirectPath 

1 


ADC reg8, mem8 

12h 


mm-xxx-xxx 

DirectPath 

4 


ADC reg 16/32/64, mreg16/32/64 

13h 


11 -xxx-xxx 

DirectPath 

1 


ADC regl6/32/64, mem16/32/64 

13h 


mm-xxx-xxx 

DirectPath 

4 


ADC AL, imm8 

14h 



DirectPath 

1 


ADC AX, imm16 

15h 



DirectPath 

1 


ADC EAX, imm32 

15h 



DirectPath 

1 


ADC RAX, imm32 (sign extended) 

15h 



DirectPath 

1 


ADC mreg8, imm8 

80h 


11-010-xxx 

DirectPath 

1 


ADC mem8, imm8 

80h 


mm-010-xxx 

DirectPath 

4 


ADC mregl6/32/64, imm16/32 

81 h 


11-010-xxx 

DirectPath 

1 


ADC mem 16/32/64, imm16/32 

81 h 


mm-010-xxx 

DirectPath 

4 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

ADC mreg16/32/64, imm8 (sign extended) 

83h 


11-010-xxx 

DirectPath 

1 


ADC mem 16/32/64, imm8 (sign extended) 

83h 


mm-010-xxx 

DirectPath 

4 


ADD mreg8, reg8 

OOh 


11 -xxx-xxx 

DirectPath 

1 


ADD mem8, reg8 

OOh 


mm-xxx-xxx 

DirectPath 

4 


ADD mregl6/32/64, reg16/32/64 

Olh 


11 -xxx-xxx 

DirectPath 

1 


ADD mem 16/32/64, reg16/32/64 

Olh 


mm-xxx-xxx 

DirectPath 

4 


ADD reg8, mreg8 

02h 


11 -xxx-xxx 

DirectPath 

1 


ADD reg8, mem8 

02h 


mm-xxx-xxx 

DirectPath 

4 


ADD reg 16/32/64, mreg16/32/64 

03h 


11 -xxx-xxx 

DirectPath 

1 


ADD regl6/32/64, mem 16/32/64 

03h 


mm-xxx-xxx 

DirectPath 

4 


ADD AL, imm8 

04h 



DirectPath 

1 


ADD AX, imm16 

05h 



DirectPath 

1 


ADD EAX, imm32 

05h 



DirectPath 

1 


ADD RAX, imm32 (sign extended) 

05h 



DirectPath 

1 


ADD mreg8, imm8 

80h 


11 -000-xxx 

DirectPath 

1 


ADD mem8, imm8 

80h 


mm-000-xxx 

DirectPath 

4 


ADD mregl6/32/64, imm16/32 

81 h 


11 -000-xxx 

DirectPath 

1 


ADD mem 16/32/64, imm16/32 

81 h 


mm-000-xxx 

DirectPath 

4 


ADD mreg16/32/64, imm8 (sign extended) 

83h 


11 -000-xxx 

DirectPath 

1 


ADD mem16/32/64, imm8 (sign extended) 

83h 


mm-000-xxx 

DirectPath 

4 


AND mreg8, reg8 

20h 


11 -xxx-xxx 

DirectPath 

1 


AND mem8, reg8 

20h 


mm-xxx-xxx 

DirectPath 

4 


AND mregl6/32/64, regl6/32/64 

21 h 


11 -xxx-xxx 

DirectPath 

1 


AND mem 16/32/64, regl6/32/64 

21 h 


mm-xxx-xxx 

DirectPath 

4 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

AND reg8, mreg8 

22h 


11 -xxx-xxx 

DirectPath 

1 


AND reg8, mem8 

22h 


mm-xxx-xxx 

DirectPath 

4 


AND reg 16/32/64, mreg16/32/64 

23h 


11 -xxx-xxx 

DirectPath 

1 


AND reg 16/32/64, mem16/32/64 

23h 


mm-xxx-xxx 

DirectPath 

4 


AND AL, imm8 

24h 



DirectPath 

1 


AND AX, imm16 

25h 



DirectPath 

1 


AND EAX, imm32 

25h 



DirectPath 

1 


AND RAX, imm32 (sign extended) 

25h 



DirectPath 

1 


AND mreg8, imm8 

80h 


11-100-xxx 

DirectPath 

1 


AND mem8, imm8 

80h 


mm-100-xxx 

DirectPath 

4 


AND mreg 16/32/64, imm16/32 

81 h 


11-100-xxx 

DirectPath 

1 


AND mem 16/32/64, imm16/32 

81 h 


mm-100-xxx 

DirectPath 

4 


AND mregl6/32/64, imm8 (sign extended) 

83h 


11-100-xxx 

DirectPath 

1 


AND mem 16/32/64, imm8 (sign extended) 

83h 


mm-100-xxx 

DirectPath 

4 


ARPL mreg16, reg16 

63h 


11 -xxx-xxx 

VectorPath 

13 


ARPL mem16, reg16 

63h 


mm-xxx-xxx 

VectorPath 

18 


BOUND reg16/32, mem 16/32&mem16/32 

62h 


mm-xxx-xxx 

VectorPath 

6 


BSF reg 16/32/64, mreg16/32/64 

OFh 

BCh 

11 -xxx-xxx 

VectorPath 

8/8/9 


BSF reg 16/32/64, mem 16/32/64 

OFh 

BCh 

mm-xxx-xxx 

VectorPath 

10/11/ 

12 


BSR reg 16/32/64, mregl6/32/64 

OFh 

BDh 

11 -xxx-xxx 

VectorPath 

11 


BSR reg 16/32/64, mem 16/32/64 

OFh 

BDh 

mm-xxx-xxx 

VectorPath 

14/13/ 

13 


BSWAP EAX/RAX/R8 

OFh 

C8h 


DirectPath 

1 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

BSWAP EBP/RBP/R13 

OFh 

CDh 


DirectPath 

1 


BSWAP EBX/RBX/R11 

OFh 

CBh 


DirectPath 

1 


BSWAP ECX/RCX/R9 

OFh 

C9h 


DirectPath 

1 


BSWAP EDI/RDI/R15 

OFh 

CFh 


DirectPath 

1 


BSWAP EDX/RDX/R10 

OFh 

CAh 


DirectPath 

1 


BSWAP ESI/RSI/R14 

OFh 

CEh 


DirectPath 

1 


BSWAP ESP/RSP/R12 

OFh 

CCh 


DirectPath 

1 


BT mreg 16/32/64, reg 16/32/64 

OFh 

A3h 

11 -xxx-xxx 

DirectPath 

1 


BT mem 16/32/64, reg 16/32/64 

OFh 

A3h 

mm-xxx-xxx 

VectorPath 

8 


BT mregl6/32/64, imm8 

OFh 

BAh 

11-100-xxx 

DirectPath 

1 


BT mem 16/32/64, imm8 

OFh 

BAh 

mm-100-xxx 

DirectPath 

4 


BTC mregl6/32/64, reg 16/32/64 

OFh 

BBh 

11 -xxx-xxx 

Double 

2 


BTC mem 16/32/64, reg 16/32/64 

OFh 

BBh 

mm-xxx-xxx 

VectorPath 

9 


BTC mregl6/32/64, imm8 

OFh 

BAh 

11-111 -XXX 

Double 

2 


BTC mem 16/32/64, imm8 

OFh 

BAh 

mm-111-xxx 

VectorPath 

5 


BTR mregl6/32/64, reg 16/32/64 

OFh 

B3h 

11 -xxx-xxx 

Double 

2 


BTR mem 16/32/64, reg 16/32/64 

OFh 

B3h 

mm-xxx-xxx 

VectorPath 

9 


BTR mregl6/32/64, imm8 

OFh 

BAh 

11-110-xxx 

Double 

2 


BTR mem 16/32/64, imm8 

OFh 

BAh 

mm-110-xxx 

VectorPath 

5 


BTS mregl6/32/64, reg 16/32/64 

OFh 

ABh 

11 -xxx-xxx 

Double 

2 


BTS mem 16/32/64, reg 16/32/64 

OFh 

ABh 

mm-xxx-xxx 

VectorPath 

9 


BTS mregl6/32/64, imm8 

OFh 

BAh 

11-101-xxx 

Double 

2 


BTS mem 16/32/64, imm8 

OFh 

BAh 

mm-101-xxx 

VectorPath 

5 


CALL displ6/32 (near, displacement) 

E8h 



VectorPath 

3 

2 


Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

CALL mem 16/32/64 (near, indirect) 

FFh 


mm-010-xxx 

VectorPath 

4 


CALL mreg16/32/64 (near, indirect) 

FFh 


11-010-xxx 

VectorPath 

4 


CALL mem16:16/32 (far, indirect) 

FFh 


11-Oil-xxx 

VectorPath 

- 


CALL pntr16:16/32 ( far, direct, no CPL 
change) 

9Ah 



VectorPath 

33 


CALL pntr16:16/32 (far, direct, CPL change) 

9Ah 



VectorPath 

150 


CBW/CWDE/CDQE 

98h 



DirectPath 

1 


CLC 

F8h 



DirectPath 

1 


CLD 

FCh 



DirectPath 

1 


CLFLUSH 

OFh 

AEh 

mm-111-xx 

DirectPath 

- 


CLI 

FAh 



VectorPath 

4 


CLTS 

OFh 

06h 


VectorPath 

10 


CMC 

F5h 



DirectPath 

1 


CMOVA/CMOVNBE reg16/32/64, 
mem 16/32/64 

OFh 

47h 

mm-xxx-xxx 

DirectPath 

4 


CMOVA/CMOVNBE reg16/32/64, reg16/32/64 

OFh 

47h 

11 -xxx-xxx 

DirectPath 

1 


CMOVAE/CMOVNB/CMOVNC reg16/32/64, 
mem 16/32/64 

OFh 

43h 

mm-xxx-xxx 

DirectPath 

4 


CMOVAE/CMOVNB/CMOVNC reg16/32/64, 
reg 16/32/64 

OFh 

43h 

11 -xxx-xxx 

DirectPath 

1 


CMOVB/CMOVC/CMOVNAE reg16/32/64, 
mem 16/32/64 

OFh 

42h 

mm-xxx-xxx 

DirectPath 

4 


CMOVB/CMOVC/CMOVNAE reg16/32/64, 
reg 16/32/64 

OFh 

42h 

11 -xxx-xxx 

DirectPath 

1 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

CMOVBE/CMOVNA reg16/32/64, 
mem 16/32/64 

OFh 

46h 

mm-xxx-xxx 

DirectPath 

4 


CMOVBE/CMOVNA reg16/32/64, reg16/32/64 

OFh 

46h 

11 -xxx-xxx 

DirectPath 

1 


CMOVE/CMOVZ reg 16/32/64, mem16/32/64 

OFh 

44h 

mm-xxx-xxx 

DirectPath 

4 


CMOVE/CMOVZ reg 16/32/64, reg16/32/64 

OFh 

44h 

11 -xxx-xxx 

DirectPath 

1 


CMOVG/CMOVNLE regl6/32/64, 
mem 16/32/64 

OFh 

4Fh 

mm-xxx-xxx 

DirectPath 

4 


CMOVG/CMOVNLE reg16/32/64, reg16/32/64 

OFh 

4Fh 

11 -xxx-xxx 

DirectPath 

1 


CMOVGE/CMOVNL reg16/32/64, 
mem 16/32/64 

OFh 

4Dh 

mm-xxx-xxx 

DirectPath 

4 


CMOVGE/CMOVNL reg16/32/64, reg16/32/64 

OFh 

4Dh 

11 -xxx-xxx 

DirectPath 

1 


CMOVL/CMOVNGE reg16/32/64, 
mem 16/32/64 

OFh 

4Ch 

mm-xxx-xxx 

DirectPath 

4 


CMOVL/CMOVNGE reg16/32/64, reg16/32/64 

OFh 

4Ch 

11 -xxx-xxx 

DirectPath 

1 


CMOVLE/CMOVNG reg16/32/64, 
mem 16/32/64 

OFh 

4Eh 

mm-xxx-xxx 

DirectPath 

4 


CMOVLE/CMOVNG reg16/32/64, reg16/32/64 

OFh 

4Eh 

11 -xxx-xxx 

DirectPath 

1 


CMOVNE/CMOVNZ reg16/32/64, 
mem 16/32/64 

OFh 

45h 

mm-xxx-xxx 

DirectPath 

4 


CMOVNE/CMOVNZ reg16/32/64, reg16/32/64 

OFh 

45h 

11 -xxx-xxx 

DirectPath 

1 


CMOVNO regl6/32/64, meml6/32/64 

OFh 

41 h 

mm-xxx-xxx 

DirectPath 

4 


CMOVNO regl6/32/64, reg16/32/64 

OFh 

41 h 

11 -xxx-xxx 

DirectPath 

1 


CMOVNP/CMOVPO regl6/32/64, 
meml 6/32/64 

OFh 

4Bh 

mm-xxx-xxx 

DirectPath 

4 


CMOVNP/CMOVPO regl6/32/64, reg16/32/64 

OFh 

4Bh 

11 -xxx-xxx 

DirectPath 

1 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX‘8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

CMOVNS reg 16/32/64, mem16/32/64 

OFh 

49h 

mm-xxx-xxx 

DirectPath 

4 


CMOVNS reg 16/32/64, reg16/32/64 

OFh 

49h 

11 -xxx-xxx 

DirectPath 

1 


CMOVO regl6/32/64, meml6/32/64 

OFh 

40h 

mm-xxx-xxx 

DirectPath 

4 


CMOVO regl6/32/64, reg16/32/64 

OFh 

40h 

11 -xxx-xxx 

DirectPath 

1 


CMOVP/CMOVPE regl6/32/64, meml6/32/64 

OFh 

4Ah 

mm-xxx-xxx 

DirectPath 

4 


CMOVP/CMOVPE regl6/32/64, reg16/32/64 

OFh 

4Ah 

11 -xxx-xxx 

DirectPath 

1 


CMOVS regl6/32/64, meml6/32/64 

OFh 

48h 

mm-xxx-xxx 

DirectPath 

4 


CMOVS regl6/32/64, reg16/32/64 

OFh 

48h 

11 -xxx-xxx 

DirectPath 

1 


CMP mem8, reg8 

38h 


mm-xxx-xxx 

DirectPath 

4 


CMP mreg8, reg8 

38h 


11 -xxx-xxx 

DirectPath 

1 


CMP mem 16/32/64, reg16/32/64 

39h 


mm-xxx-xxx 

DirectPath 

4 


CMP mregl6/32/64, reg16/32/64 

39h 


11 -xxx-xxx 

DirectPath 

1 


CMP reg8, mem8 

3Ah 


mm-xxx-xxx 

DirectPath 

4 


CMP reg8, mreg8 

3Ah 


11 -xxx-xxx 

DirectPath 

1 


CMP regl6/32/64, meml6/32/64 

3Bh 


mm-xxx-xxx 

DirectPath 

4 


CMP regl6/32/64, mreg16/32/64 

3Bh 


11 -xxx-xxx 

DirectPath 

1 


CMP AL, imm8 

3Ch 



DirectPath 

1 


CMP AX/EAX, imml6/32 

3Dh 



DirectPath 

1 


CMP RAX, / mm32 (sign extended) 

3Dh 



DirectPath 

1 


CMP mem8, imm8 

80h 


mm-111-xxx 

DirectPath 

4 


CMP mreg8, imm8 

80h 


11-111 -XXX 

DirectPath 

1 


CMP mem 16/32/64, imm16/32 

81 h 


mm-111-xxx 

DirectPath 

4 


CMP mregl6/32/64, im ml 6/32 

81 h 


11-111-xxx 

DirectPath 

1 


CMP mem16/32/64, imm8 (sign extended) 

83h 


mm-111-xxx 

DirectPath 

4 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

CMP mreg16/32/64, imm8 (sign extended) 

83h 


11-111 -XXX 

DirectPath 

1 


CM PS mem8, mem8 

A6h 



VectorPath 

6 

6 

CM PS mem 16/32/64, mem 16/32/64 

A7h 



VectorPath 

6 

6 

CMPSB 

A6h 



VectorPath 

6 

6 

CMPSD 

A7h 



VectorPath 

6 

6 

CMPSQ 

A7 



VectorPath 

6 

7 

CMPSW 

A7h 



VectorPath 

6 

6 

CMPXCHG mem8, reg8 

OFh 

BOh 

mm-xxx-xxx 

VectorPath 

5 


CMPXCHG mreg8, reg8 

OFh 

BOh 

11 -xxx-xxx 

VectorPath 

3 


CMPXCHG meml6/32/64, regl6/32/64 

OFh 

Blh 

mm-xxx-xxx 

VectorPath 

5 


CMPXCHG mregl6/32/64, reg16/32/64 

OFh 

Blh 

11 -xxx-xxx 

VectorPath 

3 


CMPXCHG8B mem64 

OFh 

C7h 

mm-xxx-xxx 

VectorPath 

10 


CMPXCHG16B meml28 

OFh 

C7h 

mm-xxx-xxx 

VectorPath 



CPUID (function 0) 

OFh 

A2h 


VectorPath 

36 


CPUID (function 1) 

OFh 

A2h 


VectorPath 

152 


CPUID (function 2) 

OFh 

A2h 


VectorPath 

38 


CPUID (function 8000_0001h) 

OFh 

A2h 


VectorPath 



CPUID (function 8000_0002h) 

OFh 

A2h 


VectorPath 



CPUID (function 8000_0003h) 

OFh 

A2h 


VectorPath 



CPUID (function 8000_0004h) 

OFh 

A2h 


VectorPath 



CPUID (function 8000_0007h) 

OFh 

A2h 


VectorPath 



CPUID (function 8000_0008h) 

OFh 

A2h 


VectorPath 



CWD/CDQ/CQO 

99h 



DirectPath 

1 


DAA 

27h 



VectorPath 

7 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

DAS 

2Fh 



VectorPath 

7 


DEC AX/EAX 

48h 



DirectPath 

1 

8 

DEC BP/EBP 

4Dh 



DirectPath 

1 

8 

DEC BX/EBX 

4Bh 



DirectPath 

1 

8 

DEC CX/ECX 

49h 



DirectPath 

1 

8 

DEC DI/EDI 

4Fh 



DirectPath 

1 

8 

DEC DX/EDX 

4Ah 



DirectPath 

1 

8 

DEC SI/ESI 

4Eh 



DirectPath 

1 

8 

DEC SP/ESP 

4Ch 



DirectPath 

1 

8 

DEC mem8 

FEh 


mm-001-xxx 

DirectPath 

4 


DEC mreg8 

FEh 


11-001-xxx 

DirectPath 

1 


DEC mem 16/32/64 

FFh 


mm-001-xxx 

DirectPath 

4 


DEC mregl6/32/64 

FFh 


11-001-xxx 

DirectPath 

1 


DIV mem8 

F6h 


mm-110-xxx 

VectorPath 

16 


DIV mreg8 

F6h 


11-110-xxx 

VectorPath 

16 


DIV mem 16/32/64 

F7h 


mm-110-xxx 

VectorPath 

23/39/ 

71 


DIV mregl6/32/64 

F7h 


11-110-xxx 

VectorPath 

23/39/ 

71 


ENTER 

C8h 



VectorPath 

14/17/ 

19/21 

5 

IDIV mreg8 

F6h 


11-111 -xxx 

VectorPath 

18 


IDIV mem8 

F6h 


mm-111-xxx 

VectorPath 

19 


IDIV mregl6/32/64 

F7h 


11-111 -xxx 

VectorPath 

26/42/ 

74 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

IDIV mem 16/32/64 

F7h 


mm-111-xxx 

VectorPath 

27/43/ 

75 


IMUL reg16, imm16 

69h 


11 -xxx-xxx 

VectorPath 

4 


IMUL reg32/64, imm32/(32 sign extended) 

69h 


11 -xxx-xxx 

DirectPath 

3/4 


IMUL reg16, mreg16, imm16 

69h 


11 -xxx-xxx 

VectorPath 

4 


IMUL reg32/64, mreg32/64, imm32/(32 sign 
extended) 

69h 


11 -xxx-xxx 

DirectPath 

3/4 


IMUL reg 16/32/64, mem 16/32/64, 
imml6/32/(32 sign extended) 

69h 


mm-xxx-xxx 

VectorPath 

7/7/8 


IMUL regl6/32/64, imm8 (sign extended) 

6Bh 


11 -xxx-xxx 

VectorPath 

4/3/4 


IMUL regl6/32/64, mregl6/32/64, imm8 
(signed) 

6Bh 


11 -xxx-xxx 

VectorPath 

4/3/4 


IMUL regl6/32/64, meml6/32/64, imm8 
(signed) 

6Bh 


mm-xxx-xxx 

VectorPath 

7/7/8 


IMUL mreg8 

F6h 


11-101-xxx 

DirectPath 

3 


IMUL mem8 

F6h 


mm-101-xxx 

DirectPath 

6 


IMUL mreg16 

F7h 


11-101-xxx 

VectorPath 

4 


IMUL mreg32/64 

F7h 


11-101-xxx 

Double 

3/5 


IMUL mem16 

F7h 


mm-101-xxx 

VectorPath 

7 


IMUL mem32/64 

F7h 


mm-101-xxx 

Double 

6/8 


IMUL regl6/32/64, mregl6/32/64 

OFh 

AFh 

11 -xxx-xxx 

DirectPath 

3/3/4 


IMUL regl6/32/64, mem 16/32/64 

OFh 

AFh 

mm-xxx-xxx 

DirectPath 

6/6/7 


IN AL, imm8 

E4h 



VectorPath 

184 


IN AX, imm8 

E5h 



VectorPath 

184 


IN EAX, imm8 

E5h 



VectorPath 

184 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

IN AL, DX 

ECh 



VectorPath 

179 


IN AX, DX 

EDh 



VectorPath 

179 


IN EAX, DX 

EDh 



VectorPath 

181 


INC AX, EAX 

40h 



DirectPath 

1 

8 

INC CX, ECX 

41 h 



DirectPath 

1 

8 

INC DX, EDX 

42h 



DirectPath 

1 

8 

INC BX, EBX 

43h 



DirectPath 

1 

8 

INC SP, ESP 

44h 



DirectPath 

1 

8 

INC BP, EBP 

45h 



DirectPath 

1 

8 

INC SI, ESI 

46h 



DirectPath 

1 

8 

INC Dl, EDI 

47h 



DirectPath 

1 

8 

INC mreg8 

FEh 


11 -000-xxx 

DirectPath 

1 


INC mem8 

FEh 


mm-000-xxx 

DirectPath 

4 


INC mreg 16/32/64 

FFh 


11 -000-xxx 

DirectPath 

1 


INC meml6/32/64 

FFh 


mm-000-xxx 

DirectPath 

4 


INSB/INS mem8, DX 

6Ch 



VectorPath 

184 


INSD/INS mem32, DX 

6Dh 



VectorPath 

185 


INSW/INS mem16, DX 

6Dh 



VectorPath 

186 


INT imm8 (no CPL change) 

CDh 



VectorPath 

87-109 


INT imm8 (CPL change) 

CDh 



VectorPath 

91-112 


INVD 

OFh 

08h 


VectorPath 

247 


INVLPG 

OFh 

Olh 

mm-111-xxx 

VectorPath 

101/80 

7 

IRET, IRETD, IRETQ (from 64-bit to 64-bit) 

CFh 



VectorPath 

91 


IRET, IRETD, IRETQ (from 64-bit to 32-bit) 

CFh 



VectorPath 

111 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

JA/JNBE disp8 

77h 



DirectPath 

1 

1 

JA/JNBE disp 16/32 

OFh 

87h 


DirectPath 

1 

1 

JAE/JNB/JNC disp8 

73h 



DirectPath 

1 

1 

JAE/JNB/JNC disp 16/32 

OFh 

83h 


DirectPath 

1 

1 

JB/JC/JNAE disp8 

72h 



DirectPath 

1 

1 

JB/JC/JNAE disp 16/32 

OFh 

82h 


DirectPath 

1 

1 

JBE/JNA disp8 

76h 



DirectPath 

1 

1 

JBE/JNA disp 16/32 

OFh 

86h 


DirectPath 

1 

1 

JCXZ/JECXZ/JRCXZ disp8 

E3h 



DirectPath 

2 

1 

JE/JZ disp8 

74h 



DirectPath 

1 

1 

JE/JZ disp 16/32 

OFh 

84h 


DirectPath 

1 

1 

JG/JNLE disp8 

7Fh 



DirectPath 

1 

1 

JG/JNLE disp 16/32 

OFh 

8Fh 


DirectPath 

1 

1 

JGE/JNL disp8 

7Dh 



DirectPath 

1 

1 

JGE/JNL disp 16/32 

OFh 

8Dh 


DirectPath 

1 

1 

JL/JNGE disp8 

7Ch 



DirectPath 

1 

1 

JL/JNGE disp 16/32 

OFh 

8Ch 


DirectPath 

1 

1 

JLE/JNG disp8 

7Eh 



DirectPath 

1 

1 

JLE/JNG disp 16/32 

OFh 

8Eh 


DirectPath 

1 

1 

JMP disp8 (short) 




DirectPath 

1 


JMP displ6/32 (near, displacement) 




DirectPath 

1 


JMP mem16/32/64 (near, indirect) 



mm-100-xxx 

DirectPath 

4 


JMP mreg16/32/64 (near, indirect) 



11-100-xxx 

DirectPath 

1 


JMP mem 16:16/32 (far, indirect, no call gate) 



mm-101-xxx 

VectorPath 

34 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

JMP mem 16:16/32 (far, indirect, call gate) 

FFh 


mm-101-xxx 

VectorPath 

123 


JMP pntrl6:16/32 ( far, direct, no call gate) 

EAh 



VectorPath 

31 


JMP pntrl6:16/32 (far, direct, call gate) 

EAh 



VectorPath 

120 


JNE/JNZ disp8 

75h 



DirectPath 

1 

1 

JNE/JNZ disp 16/32 

OFh 

85h 


DirectPath 

1 

1 

JNO disp8 

71 h 



DirectPath 

1 

1 

JNO disp 16/32 

OFh 

81 h 


DirectPath 

1 

1 

JNP/JPO disp8 

7Bh 



DirectPath 

1 

1 

JNP/JPO disp 16/32 

OFh 

8Bh 


DirectPath 

1 

1 

JNS disp8 

79h 



DirectPath 

1 

1 

JNS disp 16/32 

OFh 

89h 


DirectPath 

1 

1 

JO disp8 

70h 



DirectPath 

1 

1 

JO disp 16/32 

OFh 

80h 


DirectPath 

1 

1 

JP/JPE disp8 

7Ah 



DirectPath 

1 

1 

JP/JPE disp 16/32 

OFh 

8Ah 


DirectPath 

1 

1 

JS disp8 

78h 



DirectPath 

1 

1 

JS disp 16/32 

OFh 

88h 


DirectPath 

1 

1 

LAHF 

9Fh 



VectorPath 

3 


LAR regl6/32/64, mregl6/32/64 

OFh 

02h 

11 -xxx-xxx 

VectorPath 

22 


LAR regl6/32/64, meml6/32/64 

OFh 

02h 

mm-xxx-xxx 

VectorPath 

24 


LDS regl 6/32, mem16:16/32 

C5h 


mm-xxx-xxx 

VectorPath 

~ 


LEA regl6, meml6/32/64 

8Dh 


mm-xxx-xxx 

VectorPath 

3 


LEA reg32/64, meml6/32/64 

8Dh 


mm-xxx-xxx 

DirectPath 

1/2 

4 

LEAVE (16 bit stack size) 

C9h 



VectorPath 

3 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

LEAVE (32 or 64 bit stack size) 

C9h 



Double 

3 


LES reg 16/32, mem32/48 

C4h 


mm-xxx-xxx 

VectorPath 

- 


LFS reg 16/32, mem32/48 

OFh 

B4h 


VectorPath 

~ 


LGDT mem16:32 

OFh 

Olh 

mm-010-xxx 

VectorPath 

37 


LGDT mem16:64 

OFh 

Olh 

mm-010-xxx 

VectorPath 

~ 


LGS reg 16/32, mem32/48 

OFh 

B5h 


VectorPath 

~ 


LIDT mem16:32 

OFh 

Olh 

mm-011-xxx 

VectorPath 

148 


LIDT mem16:64 

OFh 

Olh 

mm-011-xxx 

VectorPath 

- 


LLDT mreg16 

OFh 

OOh 

11-010-xxx 

VectorPath 

34 


LLDT mem16 

OFh 

OOh 

mm-010-xxx 

VectorPath 

35 


LMSW mreg16 

OFh 

Olh 

11-100-xxx 

VectorPath 

11 


LMSW mem16 

OFh 

Olh 

mm-100-xxx 

VectorPath 

12 


LODS/LODSB mem8 

ACh 



VectorPath 

5 

6 

LODS/LODSW mem16 

ADh 



VectorPath 

5 

6 

LODS/LODSD mem32 

ADh 



VectorPath 

4 

6 

LODS/LODSQ mem64 

ADh 



VectorPath 

- 

6 

LOOP disp8 

E2h 



VectorPath 

9/8 

7 

LOOPE/LOOPZ disp8 

Elh 



VectorPath 

9/8 

7 

LOOPNE/LOOPNZ disp8 

EOh 



VectorPath 

9/8 

7 

LSL regl6/32/64, mreg16/32 

OFh 

03h 

11 -xxx-xxx 

VectorPath 

21 


LSL regl6/32/64, mem16/32 

OFh 

03h 

mm-xxx-xxx 

VectorPath 

23 


LSS regl6/32/64, mem16:16/32 

OFh 

B2h 

mm-xxx-xxx 

VectorPath 

~ 


LTR mreg16 

OFh 

OOh 

11-011-xxx 

VectorPath 

- 


LTR mem 16 

OFh 

OOh 

mm-011-xxx 

VectorPath 

~ 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

MFENCE 

OFh 

AEh 

11-110-000 

VectorPath 

- 


MOV mreg8, reg8 

88h 


11 -xxx-xxx 

DirectPath 

1 


MOV mem8, reg8 

88h 


mm-xxx-xxx 

DirectPath 

3 


MOV mregl6/32/64, regl6/32/64 

89h 


11 -xxx-xxx 

DirectPath 

1 


MOV meml6/32/64, regl6/32/64 

89h 


mm-xxx-xxx 

DirectPath 

3 


MOV reg8, mreg8 

8Ah 


11 -xxx-xxx 

DirectPath 

1 


MOV reg8, mem8 

8Ah 


mm-xxx-xxx 

DirectPath 

4 


MOV regl6/32/64, mregl6/32/64 

8Bh 


11 -xxx-xxx 

DirectPath 

1 


MOV regl6, meml6 

8Bh 


mm-xxx-xxx 

DirectPath 

4 


MOV reg32/64, mem32/64 

8Bh 


mm-xxx-xxx 

DirectPath 

3 


MOV mregl6/32/64, sreg 

8Ch 


11 -xxx-xxx 

DirectPath 

4/3 

7 

MOV meml6, sreg 

8Ch 


mm-xxx-xxx 

Double 

4 


MOV sreg, mregl6/32/64 

8Eh 


11 -xxx-xxx 

VectorPath 

8 


MOV sreg, mem 16 

8Eh 


mm-xxx-xxx 

VectorPath 

10 


MOV AL, mem8 

AOh 



DirectPath 

4 


MOV AX/EAX/RAX, meml 6/32/64 

Alh 



DirectPath 

4/3/3 


MOV mem8, AL 

A2h 



DirectPath 

3 


MOV meml6/32/64, AX/EAX/RAX 

A3h 



DirectPath 

3 


MOV AL, imm8 

BOh 



DirectPath 

1 


MOV CL, imm8 




DirectPath 

1 


MOV DL, imm8 




DirectPath 

1 


MOV BL, imm8 




DirectPath 

1 


MOV AH, imm8 

B4h 



DirectPath 

1 


MOV CH, imm8 




DirectPath 

1 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

MOV DH, imm8 

B6h 



DirectPath 

1 


MOV BH, imm8 

B7h 



DirectPath 

1 


MOV AX/EAX/RAX/R8, imml6/32/64 

B8h 



DirectPath 

1 


MOV CX/ECX/RCX/R9, imml6/32/64 

B9h 



DirectPath 

1 


MOV DX/EDX/RDX/R10, imml6/32/64 

BAh 



DirectPath 

1 


MOV BX/EBX/RBX/R11, imml6/32/64 

BBh 



DirectPath 

1 


MOV SP/ESP/RSP/R12, imml6/32/64 

BCh 



DirectPath 

1 


MOV BP/EBP/RBP/R13, imml6/32/64 

BDh 



DirectPath 

1 


MOV SI/ESI/RSI/R14, imml6/32/64 

BEh 



DirectPath 

1 


MOV DI/EDI/RDI/R15, imml6/32/64 

BFh 



DirectPath 

1 


MOV mreg8, imm8 

C6h 


11 -000-xxx 

DirectPath 

1 


MOV mem8, imm8 

C6h 


mm-000-xxx 

DirectPath 

3 


MOV mregl6/32/64, imml6/32 

C7h 


11 -000-xxx 

DirectPath 

1 


MOV meml6/32/64, imml6/32 

C7h 


mm-000-xxx 

DirectPath 

3 


MOVSB/MOVS mem8, mem8 

A4h 



VectorPath 

5 

6 

MOVSD/MOVS meml6, meml6 

A5h 



VectorPath 

5 

6 

MOVSW/MOVS mem32, mem32 

A5h 



VectorPath 

5 

6 

MOVSQ/MOVS mem64, mem64 

A5h 



VectorPath 

~ 

6 

MOVSX regl6/32/64, mreg8 

OFh 

BEh 

11 -xxx-xxx 

DirectPath 

1 


MOVSX regl6/32/64, mem8 

OFh 

BEh 

mm-xxx-xxx 

DirectPath 

4 


MOVSX reg32/64, mregl6 

OFh 

BFh 

11 -xxx-xxx 

DirectPath 

1 


MOVSX reg32/64, meml6 

OFh 

BFh 

mm-xxx-xxx 

DirectPath 

4 


MOVSXD reg64, mreg32 

63h 



DirectPath 

1 


MOVSXD reg64, mem32 

63h 



DirectPath 

4 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

MOVZX regl6/32/64, mreg8 

OFh 

B6h 

11 -xxx-xxx 

DirectPath 

1 


MOVZX regl6/32/64, mem8 

OFh 

B6h 

mm-xxx-xxx 

DirectPath 

4 


MOVZX reg32/64, mreg16 

OFh 

B7h 

11 -xxx-xxx 

DirectPath 

1 


MOVZX reg32/64, mem16 

OFh 

B7h 

mm-xxx-xxx 

DirectPath 

4 


MUL mreg8 

F6h 


11-100-xxx 

DirectPath 

3 


MUL AL, mem8 

F6h 


mm-100-xx 

DirectPath 

6 


MUL mreg16 

F7h 


11-100-xxx 

VectorPath 

4 


MUL mem16 

F7h 


mm-100-xxx 

VectorPath 

7 


MUL mreg32 

F7h 


11 -100-xxx 

Double 

3 


MUL mem32 

F7h 


mm-100-xx 

Double 

6 


MUL mreg64 

F7h 


11 -100-xxx 

Double 

5 


MUL mem64 

F7h 


mm-100-xx 

Double 

8 


NEG mreg8 

F6h 


11-011-xxx 

DirectPath 

1 


NEG mem8 

F6h 


mm-011-xxx 

DirectPath 

4 


NEG mregl6/32/64 

F7h 


11-011-xxx 

DirectPath 

1 


NEG mem 16/32/64 

F7h 


mm-011-xx 

DirectPath 

4 


NOP (XCHG EAX, EAX) 

90h 



DirectPath 

~0 

5 

NOT mreg8 

F6h 


11-010-xxx 

DirectPath 

1 


NOT mem8 

F6h 


mm-010-xx 

DirectPath 

4 


NOT mregl6/32/64 

F7h 


11-010-xxx 

DirectPath 

1 


NOT mem 16/32/64 

F7h 


mm-010-xx 

DirectPath 

4 


OR mreg8, reg8 

08h 


11 -xxx-xxx 

DirectPath 

1 


OR mem8, reg8 

08h 


mm-xxx-xxx 

DirectPath 

4 


OR mregl6/32/64, regl6/32/64 

09h 


11 -xxx-xxx 

DirectPath 

1 



Notes: 


1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX‘8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

OR mem 16/32/64, reg16/32/64 

09h 


mm-xxx-xxx 

DirectPath 

4 


OR reg8, mreg8 

OAh 


11 -xxx-xxx 

DirectPath 

1 


OR reg8, mem8 

OAh 


mm-xxx-xxx 

DirectPath 

4 


OR reg 16/32/64, mreg 16/32/64 

OBh 


11 -xxx-xxx 

DirectPath 

1 


OR reg 16/32/64, mem 16/32/64 

OBh 


mm-xxx-xxx 

DirectPath 

4 


OR AL, imm8 

OCh 



DirectPath 

1 


OR AX, imm16 

ODh 



DirectPath 

1 


OR EAX, imm32 

ODh 



DirectPath 

1 


OR RAX, imm32 (sign extended) 

ODh 



DirectPath 

1 


OR mreg8, imm8 

80h 


11-001-xxx 

DirectPath 

1 


OR mem8, imm8 

80h 


mm-001-xxx 

DirectPath 

4 


OR mreg 16/32/64 , imm 16/32 

81 h 


11-001-xxx 

DirectPath 

1 


OR mem 16/32/64, imm 16/32 

81 h 


mm-001-xxx 

DirectPath 

4 


OR mreg16/32/64, imm8 (sign extended) 

83h 


11-001-xxx 

DirectPath 

1 


OR mem 16/32/64, imm8 (sign extended) 

83h 


mm-001-xxx 

DirectPath 

4 


OUT imm8, AL 

E6h 



VectorPath 

~ 


OUT imm8, AX 

E7h 



VectorPath 

~ 


OUT imm8, EAX 

E7h 



VectorPath 

~ 


OUT DX, AL 

EEh 



VectorPath 

165 


OUT DX, AX 

EFh 



VectorPath 

165 


OUT DX, EAX 

EFh 



VectorPath 

165 


POP ES 

07h 



VectorPath 

10 


POP SS 

17h 



VectorPath 

31 


POP DS 

1 Fh 



VectorPath 

10 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

POP FS 

OFh 

Alh 


VectorPath 

10 


POP GS 

OFh 

A9h 


VectorPath 

10 


POP AX/EAX/RAX/(R8) 

58h 



Double 

3 


POP CX/ECX/RCX/(R9) 

59h 



Double 

3 


POP DX/EDX/RDX/(R10) 

5Ah 



Double 

3 


POP BX/EBX/RBX/(R11) 

5Bh 



Double 

3 


POP SP/ESP/RSP/(R12) 

5Ch 



Double 

3 


POP BP/EBP/RBP/(R13) 

5Dh 



Double 

3 


POP SI/ESI/RSI/(R14) 

5Eh 



Double 

3 


POP DI/EDI/RDI/(R15) 

5Fh 



Double 

3 


POP mreg 16/32/64 

8Fh 


11 -000-xxx 

VectorPath 

3 


POP mem 16/32/64 

8Fh 


mm-000-xxx 

VectorPath 

3 


POPA/POPAD 

61h 



VectorPath 

6 


POPF/POPFD/POPFQ 

9Dh 



VectorPath 

15 


PUSH ES 

06h 



VectorPath 

3 

2 

PUSH CS 

OEh 



VectorPath 

3 


PUSH FS 

OFh 

AOh 


VectorPath 

3 


PUSH GS 

OFh 

A8h 


VectorPath 

3 


PUSH SS 

16h 



VectorPath 

3 


PUSH DS 

1 Eh 



VectorPath 

3 

2 

PUSH AX/EAX/RAX/(R8) 

50h 



DirectPath 

3 

2 

PUSH CX/ECX/RCX/(R9) 

51h 



DirectPath 

3 

2 

PUSH DX/EDX/RDX/(R10) 

52h 



DirectPath 

3 

2 

PUSH BX/EBX/RBX/(R11) 

53h 



DirectPath 

3 

2 


Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

PUSH SP/ESP/RSP/(R12) 

54h 



DirectPath 

3 

2 

PUSH BP/EBP/RBP/(R13) 

55h 



DirectPath 

3 

2 

PUSH SI/ESI/RSI/(R14) 

56h 



DirectPath 

3 

2 

PUSH DI/EDI/RDI/(R15) 

57h 



DirectPath 

3 

2 

PUSH imm8 

6Ah 



DirectPath 

3 

2 

PUSH imm16/32 

68h 



DirectPath 

3 

2 

PUSH mregl6/32/64 

FFh 


11-110-xxx 

DirectPath 

3 


PUSH mem 16/32/64 

FFh 


mm-110-xxx 

Double 

3 

2 

PUSHA/PUSHAD 

60h 



VectorPath 

6 


PUSHF/PUSHFD/PUSHFQ 

9Ch 



VectorPath 

4 


RCL mreg8, imm8 

COh 


11-010-xxx 

VectorPath 

7 


RCL mem8, imm8 

COh 


mm-010-xxx 

VectorPath 

8 


RCL mregl6/32/64, imm8 

Clh 


11-010-xxx 

VectorPath 

7 


RCL meml6/32/64, imm8 

Clh 


mm-010-xxx 

VectorPath 

8 


RCL mreg8, 1 

DOh 


11-010-xxx 

DirectPath 

1 


RCL mem8, 1 

DOh 


mm-010-xxx 

DirectPath 

4 


RCL mregl6/32/64, 1 

Dlh 


11-010-xxx 

DirectPath 

1 


RCL meml6/32/64, 1 

Dlh 


mm-010-xxx 

DirectPath 

4 


RCL mreg8, CL 

D2h 


11-010-xxx 

VectorPath 

6 


RCL mem8, CL 

D2h 


mm-010-xxx 

VectorPath 

7 


RCL mregl6/32/64, CL 

D3h 


11-010-xxx 

VectorPath 

6 


RCL meml6/32/64, CL 

D3h 


mm-010-xxx 

VectorPath 

7 


RCR mreg8, imm8 

COh 


11-011-xxx 

VectorPath 

5 


RCR mem8, imm8 

COh 


mm-011-xxx 

VectorPath 

6 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

RCR mregl6/32/64, imm8 

Clh 


11-011-xxx 

VectorPath 

5 


RCR meml6/32/64, imm8 

Clh 


mm-011-xxx 

VectorPath 

6 


RCR mreg8, 1 

DOh 


11-011-xxx 

DirectPath 

1 


RCR mem8, 1 

DOh 


mm-011-xxx 

DirectPath 

4 


RCR mregl6/32/64, 1 

Dlh 


11-011-xxx 

DirectPath 

1 


RCR meml6/32/64, 1 

Dlh 


mm-011-xxx 

DirectPath 

4 


RCR mreg8, CL 

D2h 


11-011-xxx 

VectorPath 

4 


RCR mem8, CL 

D2h 


mm-011-xxx 

VectorPath 

6 


RCR mregl6/32/64, CL 

D3h 


11-011-xxx 

VectorPath 

4 


RCR meml6/32/64, CL 

D3h 


mm-011-xxx 

VectorPath 

6 


RDMSR 

OFh 

32h 


VectorPath 

87 


RDPMC 

OFh 

33h 


VectorPath 

- 


RDTSC 

OFh 

31 h 


VectorPath 

12 


RET near imm16 

C2h 



VectorPath 

5 


RET near 

C3h 



Double 

5 


RET far imm16 (no CPL change) 

CAh 



VectorPath 

31-44 


RET far imm16 (CPL change) 

CAh 



VectorPath 

57-72 


RET far (no CPL change) 

CBh 



VectorPath 

31-44 


RET far (CPL change) 

CBh 



VectorPath 

57-72 


ROL mreg8, imm8 

COh 


11 -000-xxx 

DirectPath 

1 

3 

ROL mem8, imm8 

COh 


mm-000-xxx 

DirectPath 

4 

3 

ROL mregl6/32/64, imm8 

Clh 


11 -000-xxx 

DirectPath 

1 

3 

ROL meml6/32/64, imm8 

Clh 


mm-000-xxx 

DirectPath 

4 

3 

ROL mreg8, 1 

DOh 


11 -000-xxx 

DirectPath 

1 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

ROL mem8, 1 

DOh 


mm-000-xxx 

DirectPath 

4 


ROL mreg 16/32/64, 1 

Dlh 


11 -000-xxx 

DirectPath 

1 


ROL mem 16/32/64, 1 

Dlh 


mm-000-xxx 

DirectPath 

4 


ROL mreg8, CL 

D2h 


11 -000-xxx 

DirectPath 

1 

3 

ROL mem8, CL 

D2h 


mm-000-xxx 

DirectPath 

4 

3 

ROL mreg 16/32/64, CL 

D3h 


11 -000-xxx 

DirectPath 

1 

3 

ROL mem 16/32/64, CL 

D3h 


mm-000-xxx 

DirectPath 

4 

3 

ROR mreg8, imm8 

COh 


11-001-xxx 

DirectPath 

1 

3 

ROR mem8, imm8 

COh 


mm-001-xxx 

DirectPath 

4 

3 

ROR mregl6/32/64, imm8 

Clh 


11-001-xxx 

DirectPath 

1 

3 

ROR mem 16/32/64, imm8 

Clh 


mm-001-xxx 

DirectPath 

4 

3 

ROR mreg8, 1 

DOh 


11-001-xxx 

DirectPath 

1 


ROR mem8, 1 

DOh 


mm-001-xxx 

DirectPath 

4 


ROR mregl6/32/64, 1 

Dlh 


11-001-xxx 

DirectPath 

1 


ROR mem 16/32/64, 1 

Dlh 


mm-001-xxx 

DirectPath 

4 


ROR mreg8, CL 

D2h 


11-001-xxx 

DirectPath 

1 

3 

ROR mem8, CL 

D2h 


mm-001-xxx 

DirectPath 

4 

3 

ROR mregl6/32/64, CL 

D3h 


11-001-xxx 

DirectPath 

1 

3 

ROR mem 16/32/64, CL 

D3h 


mm-001-xxx 

DirectPath 

4 

3 

SAHF 

9Eh 



DirectPath 

1 


SAR mreg8, imm8 

COh 


11-111 -xxx 

DirectPath 

1 

3 

SAR mem8, imm8 

COh 


mm-111-xxx 

DirectPath 

4 

3 

SAR mregl6/32/64, imm8 

Clh 


11-111 -xxx 

DirectPath 

1 

3 

SAR meml6/32/64, imm8 

Clh 


mm-111-xxx 

DirectPath 

4 

3 


Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

SAR mreg8, 1 

DOh 


11-111 -XXX 

DirectPath 

1 


SAR mem8, 1 

DOh 


mm-111-xxx 

DirectPath 

4 


SAR mregl6/32/64, 1 

Dlh 


11-111 -XXX 

DirectPath 

1 


SAR meml6/32/64, 1 

Dlh 


mm-111-xxx 

DirectPath 

4 


SAR mreg8, CL 

D2h 


11-111-xxx 

DirectPath 

1 

3 

SAR mem8, CL 

D2h 


mm-111-xxx 

DirectPath 

4 

3 

SAR mregl6/32/64, CL 

D3h 


11-111-xxx 

DirectPath 

1 

3 

SAR meml6/32/64, CL 

D3h 


mm-111-xxx 

DirectPath 

4 

3 

SBB mreg8, reg8 

18h 


11 -xxx-xxx 

DirectPath 

1 


SBB mem8, reg8 

18h 


mm-xxx-xxx 

DirectPath 

4 


SBB mregl6/32/64, reg16/32/64 

19h 


11 -xxx-xxx 

DirectPath 

1 


SBB mem 16/32/64, regl6/32/64 

19h 


mm-xxx-xxx 

DirectPath 

4 


SBB reg8, mreg8 

1 Ah 


11 -xxx-xxx 

DirectPath 

1 


SBB reg8, mem8 

1 Ah 


mm-xxx-xxx 

DirectPath 

4 


SBB regl6/32/64, mreg16/32/64 

1 Bh 


11 -xxx-xxx 

DirectPath 

1 


SBB regl6/32/64, meml6/32/64 

1 Bh 


mm-xxx-xxx 

DirectPath 

4 


SBB AL, imm8 

ICh 



DirectPath 

1 


SBB AX, imm16 

1 Dh 



DirectPath 

1 


SBB EAX, imm32 

1 Dh 



DirectPath 

1 


SBB RAX, imm32 (sign extended) 

1 Dh 



DirectPath 

1 


SBB mreg8, imm8 

80h 


11-011-xxx 

DirectPath 

1 


SBB mem8, imm8 

80h 


mm-011-xxx 

DirectPath 

4 


SBB mregl6/32/64, imm16/32 

81 h 


11-011-xxx 

DirectPath 

1 


SBB mem 16/32/64, imm16/32 

81 h 


mm-011-xxx 

DirectPath 

4 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

SBB mregl6/32/64, imm8 (sign extended) 

83h 


11-Oil-xxx 

DirectPath 

1 


SBB mem16/32/64, imm8 (sign extended) 

83h 


mm-011-xxx 

DirectPath 

4 


SCASB/SCAS mem8 

AEh 



VectorPath 

4 

6 

SCASD/SCAS mem32 

AFh 



VectorPath 

4 

6 

SCASQ/SCAS mem64 

AFh 



VectorPath 

4 

6 

SCASW/SCAS mem 16 

AFh 



VectorPath 

4 

6 

SETA/SETNBE mem8 

OFh 

97h 

mm-xxx-xxx 

DirectPath 

3 


SETA/SETNBE mreg8 

OFh 

97h 

11 -xxx-xxx 

DirectPath 

1 


S ETAE/S ETN B/S ETN C mem8 

OFh 

93h 

mm-xxx-xxx 

DirectPath 

3 


SETAE/SETNB/SETNC mreg8 

OFh 

93h 

11 -xxx-xxx 

DirectPath 

1 


SETB/SETC/SETNAE mem8 

OFh 

92h 

mm-xxx-xxx 

DirectPath 

3 


SETB/SETC/SETNAE mreg8 

OFh 

92h 

11 -xxx-xxx 

DirectPath 

1 


SETBE/SETNA mem8 

OFh 

96h 

mm-xxx-xxx 

DirectPath 

3 


SETBE/SETNA mreg8 

OFh 

96h 

11 -xxx-xxx 

DirectPath 

1 


SETE/SETZ mem8 

OFh 

94h 

mm-xxx-xxx 

DirectPath 

3 


SETE/SETZ mreg8 

OFh 

94h 

11 -xxx-xxx 

DirectPath 

1 


SETG/SETNLE mem8 

OFh 

9Fh 

mm-xxx-xxx 

DirectPath 

3 


SETG/SETNLE mreg8 

OFh 

9Fh 

11 -xxx-xxx 

DirectPath 

1 


SETGE/SETNL mem8 

OFh 

9Dh 

mm-xxx-xxx 

DirectPath 

3 


SETGE/SETNL mreg8 

OFh 

9Dh 

11 -xxx-xxx 

DirectPath 

1 


SETL/SETNGE mem8 

OFh 

9Ch 

mm-xxx-xxx 

DirectPath 

3 


SETL/SETNGE mreg8 

OFh 

9Ch 

11 -xxx-xxx 

DirectPath 

1 


SETLE/SETNG mem8 

OFh 

9Eh 

mm-xxx-xxx 

DirectPath 

3 


SETLE/SETNG mreg8 

OFh 

9Eh 

11 -xxx-xxx 

DirectPath 

1 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

SETNE/SETNZ mem8 

OFh 

95h 

mm-xxx-xxx 

DirectPath 

3 


SETNE/SETNZ mreg8 

OFh 

95h 

11 -xxx-xxx 

DirectPath 

1 


SETNO mem8 

OFh 

91 h 

mm-xxx-xxx 

DirectPath 

3 


SETNO mreg8 

OFh 

91 h 

11 -xxx-xxx 

DirectPath 

1 


SETNP/SETPO mem8 

OFh 

9Bh 

mm-xxx-xxx 

DirectPath 

3 


SETNP/SETPO mreg8 

OFh 

9Bh 

11 -xxx-xxx 

DirectPath 

1 


SETNS mem8 

OFh 

99h 

mm-xxx-xxx 

DirectPath 

3 


SETNS mreg8 

OFh 

99h 

11 -xxx-xxx 

DirectPath 

1 


SETO mem8 

OFh 

90h 

mm-xxx-xxx 

DirectPath 

3 


SETO mreg8 

OFh 

90h 

11 -xxx-xxx 

DirectPath 

1 


SETP/SETPE mem8 

OFh 

9Ah 

mm-xxx-xxx 

DirectPath 

3 


SETP/SETPE mreg8 

OFh 

9Ah 

11 -xxx-xxx 

DirectPath 

1 


SETS mem8 

OFh 

98h 

mm-xxx-xxx 

DirectPath 

3 


SETS mreg8 

OFh 

98h 

11 -xxx-xxx 

DirectPath 

1 


SGDT mem48 

OFh 

Olh 

mm-000-xxx 

VectorPath 

17/18 

7 

SIDT mem48 

OFh 

Olh 

mm-001-xxx 

VectorPath 

17/18 

7 

SHL/SAL mreg8, imm8 

COh 


11-100-xxx 

DirectPath 

1 

3 

SHL/SAL mem8, imm8 

COh 


mm-100-xxx 

DirectPath 

4 

3 

SHL/SAL mregl6/32/64, imm8 

Clh 


11-100-xxx 

DirectPath 

1 

3 

SHL/SAL meml6/32/64, imm8 

Clh 


mm-100-xxx 

DirectPath 

4 

3 

SHL/SAL mreg8, 1 

DOh 


11-100-xxx 

DirectPath 

1 


SHL/SAL mem8, 1 

DOh 


mm-100-xxx 

DirectPath 

4 


SHL/SAL mregl 6/32/64, 1 

Dlh 


11-100-xxx 

DirectPath 

1 


SHL/SAL meml6/32/64, 1 

Dlh 


mm-100-xxx 

DirectPath 

4 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 


Appendix C 


Instruction Latencies 


297 







AMpg _ 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

SHL/SAL mreg8, CL 

D2h 


11-100-xxx 

DirectPath 

1 


SHL/SAL mem8, CL 

D2h 


mm-100-xxx 

DirectPath 

4 


SHL/SAL mregl 6/32/64, CL 

D3h 


11-100-xxx 

DirectPath 

1 


SHL/SAL meml6/32/64, CL 

D3h 


mm-100-xxx 

DirectPath 

4 


SHLD mregl6/32/64, regl6/32/64, imm8 

OFh 

A4h 

11 -xxx-xxx 

VectorPath 

4 


SHLD meml6/32/64, regl6/32/64, imm8 

OFh 

A4h 

mm-xxx-xxx 

VectorPath 

6 


SHLD mregl6/32/64, regl6/32/64, CL 

OFh 

A5h 

11 -xxx-xxx 

VectorPath 

4 


SHLD mem 16/32/64, reg16/32/64, CL 

OFh 

A5h 

mm-xxx-xxx 

VectorPath 

6 

3 

SHR mreg8, imm8 

COh 


11-101-xxx 

DirectPath 

1 

3 

SHR mem8, imm8 

COh 


mm-101-xxx 

DirectPath 

4 

3 

SHR mregl6/32/64, imm8 

Clh 


11-101-xxx 

DirectPath 

1 

3 

SHR meml6/32/64, imm8 

Clh 


mm-101-xxx 

DirectPath 

4 

3 

SHR mreg8, 1 

DOh 


11-101-xxx 

DirectPath 

1 


SHR mem8, 1 

DOh 


mm-101-xxx 

DirectPath 

4 


SHR mregl6/32/64, 1 

Dlh 


11-101-xxx 

DirectPath 

1 


SHR mem 16/32/64, 1 

Dlh 


mm-101-xxx 

DirectPath 

4 


SHR mreg8, CL 

D2h 


11-101-xxx 

DirectPath 

1 

3 

SHR mem8, CL 

D2h 


mm-101-xxx 

DirectPath 

4 

3 

SHR mregl6/32/64, CL 

D3h 


11-101-xxx 

DirectPath 

1 

3 

SHR meml6/32/64, CL 

D3h 


mm-101-xxx 

DirectPath 

4 

3 

SHRD mregl6/32/64, regl6/32/64, imm8 

OFh 

ACh 

11 -xxx-xxx 

VectorPath 

4 

3 

SHRD mem 16/32/64, reg16/32/64, imm8 

OFh 

ACh 

mm-xxx-xxx 

VectorPath 

6 

3 

SHRD mregl6/32/64, regl6/32/64, CL 

OFh 

ADh 

11 -xxx-xxx 

VectorPath 

4 

3 

SHRD mem 16/32/64, reg16/32/64, CL 

OFh 

ADh 

mm-xxx-xxx 

VectorPath 

6 

3 


Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

SLDT mregl 6/32/64 

OFh 

OOh 

11 -000-xxx 

VectorPath 

5 


SLDT mem 16/32/64 

OFh 

OOh 

mm-000-xxx 

VectorPath 

5 


SMSW mregl 6/32/64 

OFh 

Olh 

11-100-xxx 

VectorPath 

4 


SMSW mem16 

OFh 

01 h 

mm-100-xxx 

VectorPath 

3 


STC 

F9h 



DirectPath 

1 


STD 

FDh 



Double 

2 


STI 

FBh 



VectorPath 

4 


STOSB/STOS mem8 

AAh 



VectorPath 

4 

6 

STOSW/STOS mem 16 

ABh 



VectorPath 

4 

6 

STOSD/STOS mem32 

ABh 



VectorPath 

4 

6 

STOSQ/STOS mem64 

ABh 



VectorPath 

4 

6 

STR mregl6/32/64 

OFh 

OOh 

11-001-xxx 

VectorPath 

5 


STR mem 16 

OFh 

OOh 

mm-001-xxx 

VectorPath 

5 


SUB mreg8, reg8 

28h 


11 -xxx-xxx 

DirectPath 

1 


SUB mem8, reg8 

28h 


mm-xxx-xxx 

DirectPath 

4 


SUB mregl6/32/64, reg 16/32/64 

29h 


11 -xxx-xxx 

DirectPath 

1 


SUB meml6/32/64, regl6/32/64 

29h 


mm-xxx-xxx 

DirectPath 

4 


SUB reg8, mreg8 

2Ah 


11 -xxx-xxx 

DirectPath 

1 


SUB reg8, mem8 

2Ah 


mm-xxx-xxx 

DirectPath 

4 


SUB regl6/32/64, mregl6/32/64 

2Bh 


11 -xxx-xxx 

DirectPath 

1 


SUB regl6/32/64, mem 16/32/64 

2Bh 


mm-xxx-xxx 

DirectPath 

4 


SUB AL, imm8 

2Ch 



DirectPath 

1 


SUB AX, imm16 

2Dh 



DirectPath 

1 


SUB EAX, imm32 

2Dh 



DirectPath 

1 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

SUB RAX, imm32 (sign extended) 

2Dh 



DirectPath 

1 


SUB mreg8, imm8 

80h 


11-101-xxx 

DirectPath 

1 


SUB mem8, imm8 

80h 


mm-101-xxx 

DirectPath 

4 


SUB mreg 16/32/64, imml6/32 

81 h 


11-101-xxx 

DirectPath 

1 


SUB meml6/32/64, imm16/32 

81 h 


mm-101-xxx 

DirectPath 

4 


SUB mreg16/32/64, imm8 (sign extended) 

83h 


11-101-xxx 

DirectPath 

1 


SUB meml6/32/64, imm8 (sign extended) 

83h 


mm-101-xxx 

DirectPath 

4 


SYSCALL 

OFh 

05h 


VectorPath 

27 


SYSENTER 

OFh 

34h 


VectorPath 

~ 


SYSEXIT 

OFh 

35h 


VectorPath 

- 


SYSRET 

OFh 

07h 


VectorPath 

35 


TEST mreg8, reg8 

84h 


11 -xxx-xxx 

DirectPath 

1 


TEST mem8, reg8 

84h 


mm-xxx-xxx 

DirectPath 

4 


TEST mregl6/32/64, regl6/32/64 

85h 


11 -xxx-xxx 

DirectPath 

1 


TEST meml6/32/64, regl6/32/64 

85h 


mm-xxx-xxx 

DirectPath 

4 


TEST AL, imm8 

A8h 



DirectPath 

1 


TEST AX/EAX/RAX, imm16/32 

A9h 



DirectPath 

1 


TEST mreg8, imm8 

F6h 


11 -000-xxx 

DirectPath 

1 


TEST mem8, imm8 

F6h 


mm-000-xxx 

DirectPath 

4 


TEST mregl6/32/64, imm16/32 

F7h 


11 -000-xxx 

DirectPath 

1 


TEST meml6/32/64, imm16/32 

F7h 


mm-000-xxx 

DirectPath 

4 


VERR mregl 6 

OFh 

OOh 

11-100-xxx 

VectorPath 

11 


VERR mem16 

OFh 

OOh 

mm-100-xxx 

VectorPath 

11 


VERW mregl6 

OFh 

OOh 

11-101-xxx 

VectorPath 

11 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

VERW mem16 

OFh 

OOh 

mm-101-xxx 

VectorPath 

11 


WAIT 

9Bh 



DirectPath 

~0 

5 

WBINVD 

OFh 

09h 


VectorPath 

9796/ 

9474 

7 

WRMSR 

OFh 

30h 


VectorPath 

134 


XADD mreg8, reg8 

OFh 

COh 

11-100-xxx 

VectorPath 

2 


XADD mem8, reg8 

OFh 

COh 

mm-100-xxx 

VectorPath 

5 


XADD mregl6/32/64, regl6/32/64 

OFh 

Clh 

11-101-xxx 

VectorPath 

2 


XADD meml6/32/64, regl6/32/64 

OFh 

Clh 

mm-101-xxx 

VectorPath 

5 


XCHG reg8, mreg8 

86h 


11 -xxx-xxx 

VectorPath 

2 


XCHG mreg8, reg8 

86h 


11 -xxx-xxx 

VectorPath 

2 


XCHG reg8, mem8 

86h 


mm-xxx-xxx 

VectorPath 

16 


XCHG mem8, reg8 

86h 


mm-xxx-xxx 

VectorPath 

16 


XCHG regl6/32/64, mregl6/32/64 

87h 


11 -xxx-xxx 

VectorPath 

2 


XCHG mregl6/32/64, regl6/32/64 

87h 


11 -xxx-xxx 

VectorPath 

2 


XCHG regl6/32/64, meml6/32/64 

87h 


mm-xxx-xxx 

VectorPath 

16 


XCHG meml6/32/64, regl6/32/64 

87h 


mm-xxx-xxx 

VectorPath 

16 


XCHG AX/EAX/RAX, AX/EAX/RAX/(R8) 

(NOP) 

90h 



DirectPath 

~0 

5 

XCHG AX/EAX/RAX, CX/ECX/RCX/(R9) 

91 h 



VectorPath 

2 


XCHG AX/EAX/RAX, DX/EDX/RDX/(R10) 

92h 



VectorPath 

2 


XCHG AX/EAX/RAX, BX/EBX/RBX/(R11) 

93h 



VectorPath 

2 


XCHG AX/EAX/RAX, SP/ESP/RSP/(R12) 

94h 



VectorPath 

2 


XCHG AX/EAX/RAX, BP/EBP/RBP/(R13) 

95h 



VectorPath 

2 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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Table 13. Integer Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM 

byte 

XCHG AX/EAX/RAX, SI/ESI/RSI/(R14) 

96h 



VectorPath 

2 


XCHG AX/EAX/RAX, DI/EDI/RDI/(R15) 

97h 



VectorPath 

2 


XLATB/XLAT mem8 

D7h 



VectorPath 

5 


XOR mreg8, reg8 

30h 


11 -xxx-xxx 

DirectPath 

1 


XOR mem8, reg8 

30h 


mm-xxx-xxx 

DirectPath 

4 


XOR mregl6/32/64, reg16/32/64 

31h 


11 -xxx-xxx 

DirectPath 

1 


XOR mem 16/32/64, reg16/32/64 

31 h 


mm-xxx-xxx 

DirectPath 

4 


XOR reg8, mreg8 

32h 


11 -xxx-xxx 

DirectPath 

1 


XOR reg8, mem8 

32h 


mm-xxx-xxx 

DirectPath 

4 


XOR reg 16/32/64, mregl6/32/64 

33h 


11 -xxx-xxx 

DirectPath 

1 


XOR reg16/32/64, mem 16/32/64 

33h 


mm-xxx-xxx 

DirectPath 

4 


XOR AL, imm8 

34h 



DirectPath 

1 


XOR AX, imm16 

35h 



DirectPath 

1 


XOR EAX, imm32 

35h 



DirectPath 

1 


XOR RAX, imm32 (sign extended) 

35h 



DirectPath 

1 


XOR mreg8, imm8 

80h 


11-110-xxx 

DirectPath 

1 


XOR mem8, v 

80h 


mm-110-xxx 

DirectPath 

4 


XOR mregl6/32/64, imm 16/32 

81h 


11-110-xxx 

DirectPath 

1 


XOR mem 16/32/64, imm 16/32 

81 h 


mm-110-xxx 

DirectPath 

4 


XOR mreg16/32/64, imm8 (sign extended) 



11-110-xxx 

DirectPath 

1 


XOR mem16/32/64, imm8 (sign extended) 



mm-110-xxx 

DirectPath 

4 



Notes: 

1. Static timing assumes a predicted branch. 

2. Store operation also updates ESP—the new register value is available one clock earlier than the specified 
latency. 

3. The clock count, regardless of the number of shifts or rotates, as determined by CL or imm8. 

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index 
form LEA EAX, [EDX+EDI]). Forms with a scale or more than two source operands will have a latency of 2 (LEA 
EAX, [EBX+EBX*8]). 

5. These instructions have an effective latency as shown. They map to internal NOPs that can be issued at a rate of 
three per cycle but do not occupy execution resources. 

6. The latency of repeated string instructions can be found in “Latency of Repeated String Instructions” on 
page 167. 

7. The first latency value is for 32-bit mode. The second is for 64-bit mode. 

8. This opcode is used as a REX prefix in 64-bit mode. 
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C.3 MMX™ Technology Instructions 


Table 14. MMX™ Technology Instructions 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

ModRM byte 

EMMS 

OFh 

77h 


DirectPath 

FADD/FMUL/ 

FSTORE 

6 

2 

MOVD mmreg, reg32 

OFh 

6Eh 

11 -xxx-xxx 

Double 

- 

9 

1 

MOVD mmreg, reg64 

OFh 

6Eh 

11 -xxx-xxx 

Double 

- 

9 

1 

MOVD mmreg, mem32 

OFh 

6Eh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL/ 

FSTORE 

4 

2 

MOVD mmreg, mem64 

OFh 

6Eh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL/ 

FSTORE 

4 

2 

MOVD reg32, mmreg 

OFh 

7Eh 

11 -xxx-xxx 

Double 

- 

4 

1 

MOVD reg64, mmreg 

OFh 

7Eh 

11 -xxx-xxx 

Double 

- 

4 

1 

MOVD mem32, mmreg 

OFh 

7Eh 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 


MOVD mem64, mmreg 

OFh 

7Eh 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 


MOVQ mmregl, mmreg2 

OFh 

6Fh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


MOVQ mmreg, mem64 

OFh 

6Fh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL/ 

FSTORE 

4 

2 

MOVQ mmreg2, mmregl 

OFh 

7Fh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


MOVQ mem64, mmreg 

OFh 

7Fh 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 


PACKSSDW mmregl, mmreg2 

OFh 

6Bh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PACKSSDW mmreg, mem64 

OFh 

6Bh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PACKSSWB mmregl, mmreg2 

OFh 

63h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PACKSSWB mmreg, mem64 

OFh 

63h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PACKUSWB mmregl, mmreg2 

OFh 

67h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PACKUSWB mmreg, mem64 

OFh 

67h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PADDB mmregl, mmreg2 

OFh 

FCh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PADDB mmreg, mem64 

OFh 

FCh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PADDD mmregl, mmreg2 

OFh 

FEh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PADDD mmreg, mem64 

OFh 

FEh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PADDSB mmregl, mmreg2 

OFh 

ECh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PADDSB mmreg, mem64 

OFh 

ECh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 



Notes: 

1. Bits 2, 1, and 0 of the ModRM byte select the integer register. 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 
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Table 14. MMX™ Technology Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

ModRM byte 

PADDSW mmregl, mmreg2 

OFh 

EDh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PADDSW mmreg, mem64 

OFh 

EDh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PADDUSB mmregl, mmreg2 

OFh 

DCh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PADDUSB mmreg, mem64 

OFh 

DCh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PADDUSW mmregl, mmreg2 

OFh 

DDh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PADDUSW mmreg, mem64 

OFh 

DDh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PADDW mmregl, mmreg2 

OFh 

FDh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PADDW mmreg, mem64 

OFh 

FDh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PAND mmregl, mmreg2 

OFh 

DBh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PAND mmreg, mem64 

OFh 

DBh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PANDN mmregl, mmreg2 

OFh 

DFh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PANDN mmreg, mem64 

OFh 

DFh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PCMPEQB mmregl, mmreg2 

OFh 

74h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PCMPEQB mmreg, mem64 

OFh 

74h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PCMPEQD mmregl, mmreg2 

OFh 

76h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PCMPEQD mmreg, mem64 

OFh 

76h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PCMPEQW mmregl, mmreg2 

OFh 

75h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PCMPEQW mmreg, mem64 

OFh 

75h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PCMPGTB mmregl, mmreg2 

OFh 

64h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PCMPGTB mmreg, mem64 

OFh 

64h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PCMPGTD mmregl, mmreg2 

OFh 

66h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PCMPGTD mmreg, mem64 

OFh 

66h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PCMPGTW mmregl, mmreg2 

OFh 

65h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PCMPGTW mmreg, mem64 

OFh 

65h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PMADDWD mmregl, mmreg2 

OFh 

F5h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


PMADDWD mmreg, mem64 

OFh 

F5h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 


PMULHW mmregl, mmreg2 

OFh 

E5h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


PMULHW mmreg, mem64 

OFh 

E5h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 


PMULLW mmregl, mmreg2 

OFh 

D5h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


PMULLW mmreg, mem64 

OFh 

D5h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 



Notes: 

1. Bits 2, 1, and 0 of the ModRM byte select the integer register. 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 
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Table 14. MMX™ Technology Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

ModRM byte 

POR mmregl, mmreg2 

OFh 

EBh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


POR mmreg, mem64 

OFh 

EBh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSLLD mmregl, mmreg2 

OFh 

F2h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSLLD mmreg, mem64 

OFh 

F2h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSLLD mmreg, imm8 

OFh 

72h 

11-110-xxx 

DirectPath 

FADD/FMUL 

2 


PSLLQ mmregl, mmreg2 

OFh 

F3h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSLLQ mmreg, mem64 

OFh 

F3h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSLLQ mmreg, imm8 

OFh 

73h 

11-110-xxx 

DirectPath 

FADD/FMUL 

2 


PSLLW mmregl, mmreg2 

OFh 

Flh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSLLW mmreg, mem64 

OFh 

Flh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSLLW mmreg, imm8 

OFh 

71 h 

11-110-xxx 

DirectPath 

FADD/FMUL 

2 


PSRAD mmregl, mmreg2 

OFh 

E2h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSRAD mmreg, mem64 

OFh 

E2h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSRAD mmreg, imm8 

OFh 

72h 

11-100-xxx 

DirectPath 

FADD/FMUL 

2 


PSRAW mmregl, mmreg2 

OFh 

Elh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSRAW mmreg, mem64 

OFh 

Elh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSRAW mmreg, imm8 

OFh 

71 h 

11-100-xxx 

DirectPath 

FADD/FMUL 

2 


PSRLD mmregl, mmreg2 

OFh 

D2h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSRLD mmreg, mem64 

OFh 

D2h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSRLD mmreg, imm8 

OFh 

72h 

11-010-xxx 

DirectPath 

FADD/FMUL 

2 


PSRLQ mmregl, mmreg2 

OFh 

D3h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSRLQ mmreg, mem64 

OFh 

D3h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSRLQ mmreg, imm8 

OFh 

73h 

11-010-xxx 

DirectPath 

FADD/FMUL 

2 


PSRLW mmregl, mmreg2 

OFh 

Dlh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSRLW mmreg, mem64 

OFh 

Dlh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSRLW mmreg, imm8 

OFh 

71 h 

11-010-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBB mmregl, mmreg2 

OFh 

F8h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBB mmreg, mem64 

OFh 

F8h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSUBD mmregl, mmreg2 

OFh 

FAh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBD mmreg, mem64 

OFh 

FAh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 



Notes: 

1. Bits 2, 1, and 0 of the ModRM byte select the integer register. 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 


Appendix C 


Instruction Latencies 


305 








AMpg _ 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


Table 14. MMX™ Technology Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

ModRM byte 

PSUBSB mmregl, mmreg2 

OFh 

E8h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBSB mmreg, mem64 

OFh 

E8h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSUBSW mmregl, mmreg2 

OFh 

E9h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBSW mmreg, mem64 

OFh 

E9h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSUBUSB mmregl, mmreg2 

OFh 

D8h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBUSB mmreg, mem64 

OFh 

D8h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSUBUSW mmregl, mmreg2 

OFh 

D9h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBUSW mmreg, mem64 

OFh 

D9h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PSUBW mmregl, mmreg2 

OFh 

F9h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PSUBW mmreg, mem64 

OFh 

F9h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PUNPCKHBW mmregl, 
mmreg2 

OFh 

68h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PUNPCKHBW mmreg, mem64 

OFh 

68h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PUNPCKHDQ mmregl, 
mmreg2 

OFh 

6Ah 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PUNPCKHDQ mmreg, mem64 

OFh 

6Ah 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PUNPCKHWD mmregl, 
mmreg2 

OFh 

69h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PUNPCKHWD mmreg, mem64 

OFh 

69h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PUNPCKLBW mmregl, 
mmreg2 

OFh 

60h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PUNPCKLBW mmreg, mem64 

OFh 

60h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PUNPCKLDQ mmregl, 
mmreg2 

OFh 

62h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PUNPCKLDQ mmreg, mem64 

OFh 

62h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PUNPCKLWD mmregl, 
mmreg2 

OFh 

61h 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PUNPCKLWD mmreg, mem64 

OFh 

61h 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PXOR mmregl, mmreg2 

OFh 

EFh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PXOR mmreg, mem64 

OFh 

EFh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 



Notes: 

1. Bits 2, 1, and 0 of the ModRM byte select the integer register. 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 
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C.4 x87 Floating-Point Instructions 


Table 15. x87 Floating-Point Instructions 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM byte 

F2XM1 

D9h 


11-110-000 

VectorPath 

- 

65 


FABS 

D9h 


11-100-001 

DirectPath 

FMUL 

2 


FADD ST, ST(i) 

D8h 


11 -000-xxx 

DirectPath 

FADD 

4 

1 

FADD [mem32real] 

D8h 


mm-000-xxx 

DirectPath 

FADD 

6 


FADD ST(i), ST 

DCh 


11 -000-xxx 

DirectPath 

FADD 

4 

1 

FADD [mem64real] 

DCh 


mm-000-xxx 

DirectPath 

FADD 

6 


FADDP ST(i), ST 

DEh 


11 -000-xxx 

DirectPath 

FADD 

4 

1 

FBLD [mem80] 

DFh 


mm-100-xxx 

VectorPath 

- 

87 


FBSTP [mem80] 

DFh 


mm-110-xxx 

VectorPath 

- 

172 


FCHS 

D9h 


11-100-000 

DirectPath 

FMUL 

2 


FCLEX 

DBh 

E2h 

11-100-010 

VectorPath 

- 

~ 


FCMOVB ST(0), ST(/) 

DAh 


11 -000-xxx 

VectorPath 

- 

15 

5 

FCMOVBE ST(0), ST(/) 

DAh 


11-010-xxx 

VectorPath 

- 

15 

5 

FCMOVE ST(0), ST(/) 

DAh 


11-001-xxx 

VectorPath 

- 

15 

5 

FCMOVNB ST(0), ST(/) 

DBh 


11 -000-xxx 

VectorPath 

- 

15 

5 

FCMOVNBE ST(0), ST(/) 

DBh 


11-010-xxx 

VectorPath 

- 

15 

5 

FCMOVNE ST(0), ST(/) 

DBh 


11-001-xxx 

VectorPath 

- 

15 

5 

FCMOVNU ST(0), ST(/) 

DBh 


11-011-xxx 

VectorPath 

- 

15 

5 

FCMOVU ST(0), ST(/) 

DAh 


11-011-xxx 

VectorPath 

- 

15 

5 

FCOM ST(i) 

D8h 


11-010-xxx 

DirectPath 

FADD 

2 

1 

FCOM [mem32real] 

D8h 


mm-010-xxx 

DirectPath 

FADD 

4 


FCOM [mem64real] 

DCh 


mm-010-xxx 

DirectPath 

FADD 

4 


FCOMI ST, ST(i) 

DBh 


11-110-xxx 

VectorPath 

FADD 

3 

3 


Notes: 

1. The last three bits of the ModRM byte select the stack entry ST(i). 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 

3. This is a VectorPath decoded operation that uses one execution pipe (one FtOP). 

4. There is additional latency associated with this instruction, “e” represents the difference between the exponents 
of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then 

n = (s+1)/2 where (0 <= n <= 32). 

5. The latency provided for this operation is the best-case latency. 

6. The three latency numbers represent the latency values for precision control settings of single precision, double 
precision, and extended precision, respectively. 
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Table 15. x87 Floating-Point Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM byte 

FCOMIP ST, ST(i) 

DFh 


11-110-xxx 

VectorPath 

FADD 

3 

3 

FCOMP ST(i) 

D8h 


11-011-xxx 

DirectPath 

FADD 

2 

1 

FCOMP [mem32real] 

D8h 


mm-011-xxx 

DirectPath 

FADD 

4 


FCOMP [mem64real] 

DCh 


mm-011-xxx 

DirectPath 

FADD 

4 


FCOMPP 

DEh 


11-011-001 

DirectPath 

FADD 

2 


FCOS 

D9h 


11-111-111 

VectorPath 

- 

92 


FDECSTP 

D9h 


11-110-110 

DirectPath 

FADD/FMUL/ 

FSTORE 

2 


FDIV ST, ST(i) 

D8h 


11-110-xxx 

DirectPath 

FMUL 

16/20 

/24 

1,6 

FDIV ST(i), ST 

DCh 


11-111-xxx 

DirectPath 

FMUL 

16/20 

/24 

1, 6 

FDIV [mem32real] 

D8h 


mm-110-xxx 

DirectPath 

FMUL 

18/22 

/26 

6 

FDIV [mem64real] 

DCh 


mm-110-xxx 

DirectPath 

FMUL 

18/22 

/26 

6 

FDIVP ST(i), ST 

DEh 


11-111-xxx 

DirectPath 

FMUL 

16/20 

/24 

1,6 

FDIVR ST, ST(i) 

D8h 


11-110-xxx 

DirectPath 

FMUL 

16/20 

/24 

1, 6 

FDIVR ST(i), ST 

DCh 


11-111-xxx 

DirectPath 

FMUL 

16/20 

/24 

1, 6 

FDIVR [mem32real] 

D8h 


mm-111-xxx 

DirectPath 

FMUL 

18/22 

/26 

6 

FDIVR [mem64real] 

DCh 


mm-111-xxx 

DirectPath 

FMUL 

18/22 

/26 

6 

FDIVRP 

DEh 


11-110-001 

DirectPath 

FMUL 

16/20 

/24 

6 


Notes: 

1. The last three bits of the ModRM byte select the stack entry ST(i). 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 

3. This is a VectorPath decoded operation that uses one execution pipe (one FtOP). 

4. There is additional latency associated with this instruction, “e” represents the difference between the exponents 
of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then 

n = (s+1)/2 where (0 <= n <= 32). 

5. The latency provided for this operation is the best-case latency. 

6. The three latency numbers represent the latency values for precision control settings of single precision, double 
precision, and extended precision, respectively. 
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Table 15. x87 Floating-Point Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM byte 

FDIVRP ST(i), ST 

DEh 


11-110-xxx 

DirectPath 

FMUL 

16/20 

/24 

1, 6 

FFREE ST(i) 

DDh 


11 -000-xxx 

DirectPath 

FADD/FMUL/ 

FSTORE 

2 

1,2 

FIADD [mem32int] 

DAh 


mm-000-xxx 

Double 

- 

11 


FIADD [mem16int] 

DEh 


mm-000-xxx 

Double 

- 

11 


FICOM [mem32int] 

DAh 


mm-010-xxx 

Double 

- 

9 


FICOM [mem16int] 

DEh 


mm-010-xxx 

Double 

- 

9 


FICOMP [mem32int] 

DAh 


mm-011-xxx 

Double 

- 

9 


FICOMP [mem16int] 

DEh 


mm-011-xxx 

Double 

- 

9 


FIDIV [mem32int] 

DAh 


mm-110-xxx 

Double 

- 

18 


FIDIV [mem16int] 

DEh 


mm-110-xxx 

Double 

- 

18 


FIDIVR [mem32int] 

DAh 


mm-111-xxx 

Double 

- 

18 


FIDIVR [mem16int] 

DEh 


mm-111-xxx 

Double 

- 

18 


FILD [mem16int] 

DFh 


mm-000-xxx 

DirectPath 

FSTORE 

6 


FILD [mem32int] 

DBh 


mm-000-xxx 

DirectPath 

FSTORE 

6 


FILD [mem64int] 

DFh 


mm-101-xxx 

DirectPath 

FSTORE 

6 


FIMUL [mem32int] 

DAh 


mm-001-xxx 

Double 

- 

11 


FIMUL [mem16int] 

DEh 


mm-001-xxx 

Double 

- 

11 


FINCSTP 

D9h 


11-110-111 

DirectPath 

FADD/FMUL/ 

FSTORE 

2 

2 

FINIT 

DBh 


11-100-011 

VectorPath 

- 

~ 


FIST [mem16int] 

DFh 


mm-010-xxx 

DirectPath 

FSTORE 

4 


FIST [mem32int] 

DBh 


mm-010-xxx 

DirectPath 

FSTORE 

4 


FISTP [mem16int] 

DFh 


mm-011-xxx 

DirectPath 

FSTORE 

4 


FISTP [mem32int] 

DBh 


mm-011-xxx 

DirectPath 

FSTORE 

4 



Notes: 

1. The last three bits of the ModRM byte select the stack entry ST(i). 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 

3. This is a VectorPath decoded operation that uses one execution pipe (one FtOP). 

4. There is additional latency associated with this instruction, “e” represents the difference between the exponents 
of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then 

n = (s+1)/2 where (0 <= n <= 32). 

5. The latency provided for this operation is the best-case latency. 

6. The three latency numbers represent the latency values for precision control settings of single precision, double 
precision, and extended precision, respectively. 
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Table 15. x87 Floating-Point Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM byte 

FISTP [mem64int] 



mm-111-xxx 

DirectPath 

FSTORE 

4 


FISTTP [mem16int] 



mm-010-xxx 

DirectPath 

FSTORE 

4 


FISTTP [mem32int] 



mm-010-xxx 

DirectPath 

FSTORE 

4 


FISTTP [mem64int] 



mm-010-xxx 

DirectPath 

FSTORE 

4 


FISUB [mem32int] 

DAh 


mm-100-xxx 

Double 

- 

11 


FISUB [mem16int] 



mm-100-xxx 

Double 

- 

11 


FISUBR [mem32int] 

DAh 


mm-101-xxx 

Double 

- 

11 


FISUBR [mem16int] 

DEh 


mm-101-xxx 

Double 

- 

11 


FLD ST(i) 

D9h 


11 -000-xxx 

DirectPath 

FADD/FMUL 

2 

1 

FLD [mem32real] 

D9h 


mm-000-xxx 

DirectPath 

FADD/FMUL/ 

FSTORE 

4 


FLD [mem64real] 

DDh 


mm-000-xxx 

DirectPath 

FADD/FMUL/ 

FSTORE 

4 


FLD [mem80real] 

DBh 


mm-101-xxx 

VectorPath 

- 

13 


FLD1 

D9h 


11-101-000 

DirectPath 

FSTORE 

4 


FLDCW [mem16] 

D9h 


mm-101-xxx 

VectorPath 

- 

11 


FLDENV [mem14byte] 

D9h 


mm-100-xxx 

VectorPath 

- 

129 


FLDENV [mem28byte] 

D9h 


mm-100-xxx 

VectorPath 

- 

129 


FLDL2E 

D9h 


11-101-010 

DirectPath 

FSTORE 

4 


FLDL2T 

D9h 


11-101-001 

DirectPath 

FSTORE 

4 


FLDLG2 

D9h 


11-101-100 

DirectPath 

FSTORE 

4 


FLDLN2 

D9h 


11-101-101 

DirectPath 

FSTORE 

4 


FLDPI 

D9h 


11-101-011 

DirectPath 

FSTORE 

4 


FLDZ 

D9h 


11-101-110 

DirectPath 

FSTORE 

4 


FMUL ST, ST(i) 

D8h 


11-001-xxx 

DirectPath 

FMUL 

4 

1 

FMUL ST(i), ST 

DCh 


11-001-xxx 

DirectPath 

FMUL 

4 

1 


Notes: 

1. The last three bits of the ModRM byte select the stack entry ST(i). 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 

3. This is a VectorPath decoded operation that uses one execution pipe (one ROP). 

4. There is additional latency associated with this instruction, “e” represents the difference between the exponents 
of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then 

n = (s+1)/2 where (0 <= n <= 32). 

5. The latency provided for this operation is the best-case latency. 

6. The three latency numbers represent the latency values for precision control settings of single precision, double 
precision, and extended precision, respectively. 
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Table 15. x87 Floating-Point Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM byte 

FMUL [mem32real] 

D8h 


mm-001-xxx 

DirectPath 

FMUL 

6 


FMUL [mem64real] 

DCh 


mm-001-xxx 

DirectPath 

FMUL 

6 


FMULP ST(i), ST 

DEh 


11-001-xxx 

DirectPath 

FMUL 

4 

1 

FNCLEX 

DBh 

E2h 


VectorPath 


16 


FNINIT 

DBh 

E3h 


VectorPath 


89 


FNOP 

D9h 


11-010-000 

DirectPath 

FADD/FMUL/ 

FSTORE 

2 

2 

FPATAN 

D9h 


11-110-011 

VectorPath 

- 

136 


FPREM 

D9h 


11-111-000 

DirectPath 

FMUL 

9+e+n 

4 

FPREM1 

D9h 


11-110-101 

DirectPath 

FMUL 

9+e+n 

4 

FPTAN 

D9h 


11-110-010 

VectorPath 

- 

107 


FRNDINT 

D9h 


11-111-100 

VectorPath 

- 

10 


FRSTOR [mem94byte] 

DDh 


mm-100-xxx 

VectorPath 

- 

138 


FRSTOR [mem108byte] 

DDh 


mm-100-xxx 

VectorPath 

- 

138 


FSAVE [mem94byte] 

DDh 


mm-110-xxx 

VectorPath 

- 

159 


FSAVE [mem108byte] 

DDh 


mm-110-xxx 

VectorPath 

- 

159 


FSCALE 

D9h 


11-111-101 

VectorPath 

- 

9 


FSIN 

D9h 


11-111-110 

VectorPath 

- 

93 


FSINCOS 

D9h 


11-111-011 

VectorPath 

- 

104 


FSQRT 

D9h 


11-111-010 

DirectPath 

FMUL 

35 


FST [mem32real] 

D9h 


mm-010-xxx 

DirectPath 

FSTORE 

2 


FST [mem64real] 

DDh 


mm-010-xxx 

DirectPath 

FSTORE 

2 


FST ST(i) 

DDh 


11-OIOxxx 

DirectPath 

FADD/FMUL 

2 


FSTCW [mem16] 

D9h 


mm-111-xxx 

VectorPath 

- 

4 


FSTENV [mem14byte] 

D9h 


mm-110-xxx 

VectorPath 

- 

89 



Notes: 

1. The last three bits of the ModRM byte select the stack entry ST(i). 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 

3. This is a VectorPath decoded operation that uses one execution pipe (one ROP). 

4. There is additional latency associated with this instruction, “e” represents the difference between the exponents 
of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then 

n = (s+1)/2 where (0 <= n <= 32). 

5. The latency provided for this operation is the best-case latency. 

6. The three latency numbers represent the latency values for precision control settings of single precision, double 
precision, and extended precision, respectively. 
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Table 15. x87 Floating-Point Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM byte 

FSTENV [mem28byte] 

D9h 


mm-110-xxx 

VectorPath 

- 

89 


FSTP [mem32real] 

D9h 


mm-011-xxx 

DirectPath 

FADD/FMUL 

2 


FSTP [mem64real] 

DDh 


mm-011-xxx 

DirectPath 

FADD/FMUL 

2 


FSTP [mem80real] 

D9h 


mm-111-xxx 

VectorPath 

- 

8 


FSTP ST(i) 

DDh 


11-011-xxx 

DirectPath 

FADD/FMUL 

2 


FSTSW AX 

DFh 


11-100-000 

VectorPath 

- 

12 


FSTSW [mem16] 

DDh 


mm-111-xxx 

VectorPath 

FSTORE 

8 

3 

FSUB [mem32real] 

D8h 


mm-100-xxx 

DirectPath 

FADD 

6 


FSUB [mem64real] 

DCh 


mm-100-xxx 

DirectPath 

FADD 

6 


FSUB ST, ST(i) 

D8h 


11-100-xxx 

DirectPath 

FADD 

4 

1 

FSUB ST(i), ST 

DCh 


11-101-xxx 

DirectPath 

FADD 

4 

1 

FSUBP ST(i), ST 

DEh 


11-101-xxx 

DirectPath 

FADD 

4 

1 

FSUBR [mem32real] 

D8h 


mm-101-xxx 

DirectPath 

FADD 

6 


FSUBR [mem64real] 

DCh 


mm-101-xxx 

DirectPath 

FADD 

6 


FSUBR ST, ST(i) 

D8h 


11-100-xxx 

DirectPath 

FADD 

4 

1 

FSUBR ST(i), ST 

DCh 


11-101-xxx 

DirectPath 

FADD 

4 

1 

FSUBRP ST(i), ST 

DEh 


11-100-xxx 

DirectPath 

FADD 

4 

1 

FTST 

D9h 


11-100-100 

DirectPath 

FADD 

2 


FUCOM 

DDh 


11-100-xxx 

DirectPath 

FADD 

2 


FUCOMI ST, ST(i) 



11-101-xxx 

VectorPath 

FADD 

3 

3 

FUCOMIP ST, ST(i) 



11-101-xxx 

VectorPath 

FADD 

3 

3 

FUCOMP 



11-101-xxx 

DirectPath 

FADD 

2 


FUCOMPP 

DAh 


11-101-001 

DirectPath 

FADD 

2 


FWAIT 




DirectPath 

- 

0 


FXAM 



11-100-101 

VectorPath 

- 

2 



Notes: 

1. The last three bits of the ModRM byte select the stack entry ST(i). 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 

3. This is a VectorPath decoded operation that uses one execution pipe (one ROP). 

4. There is additional latency associated with this instruction, “e” represents the difference between the exponents 
of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then 

n = (s+1)/2 where (0 <= n <= 32). 

5. The latency provided for this operation is the best-case latency. 

6. The three latency numbers represent the latency values for precision control settings of single precision, double 
precision, and extended precision, respectively. 
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Table 15. x87 Floating-Point Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

First 

byte 

Second 

byte 

ModRM byte 

FXCH 

D9h 


11-001-xxx 

DirectPath 

FADD/FMUL/ 

FSTORE 

2 

2 

FXRSTOR [mem512byte] 

OFh 

AEh 

mm-001-xxx 

VectorPath 

- 

68 (108) 


FXSAVE [mem512byte] 

OFh 

AEh 

mm-000-xxx 

VectorPath 

- 

31 (79) 


FXTRACT 

D9h 


11-110-100 

VectorPath 

- 

9 


FYL2X 

D9h 


11-110-001 

VectorPath 

- 

~ 


FYL2XP1 

D9h 


11-111-001 

VectorPath 

- 

113 



Notes: 

1. The last three bits of the ModRM byte select the stack entry ST(i). 

2. These instructions have an effective latency as shown. However, these instructions generate an internal NOP 
with a latency of two cycles but no related dependencies. These internal NOPs can be executed at a rate of 
three per cycle and can use any of the three execution resources. 

3. This is a VectorPath decoded operation that uses one execution pipe (one ROP). 

4. There is additional latency associated with this instruction, “e” represents the difference between the exponents 
of the divisor and the dividend. If “s” is the number of normalization shifts performed on the result, then 

n = (s+1)/2 where (0 <= n <= 32). 

5. The latency provided for this operation is the best-case latency. 

6. The three latency numbers represent the latency values for precision control settings of single precision, double 
precision, and extended precision, respectively. 
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C.5 3DNow!™ Technology Instructions 


Table 16. 3DNow!™ Technology Instructions 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

Prefix 

byte(s) 

imm8 

ModRM 

byte 

FEMMS 

OFh 

OEh 


DirectPath 

FADD/FMUL/ 

FSTORE 

2 

2 

PAVGUSB mmregl, mmreg2 

OFh, OFh 

BFh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PAVGUSB mmreg, mem64 

OFh, OFh 

BFh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PF2ID mmregl, mmreg2 

OFh, OFh 

1 Dh 

11 -xxx-xxx 

DirectPath 

FADD 

4 


PF2ID mmreg, mem64 

OFh, OFh 

1 Dh 

mm-xxx-xxx 

DirectPath 

FADD 

6 


PFACC mmregl, mmreg2 

OFh, OFh 

AEh 

11 -xxx-xxx 

DirectPath 

FADD 

4 


PFACC mmreg, mem64 

OFh, OFh 

AEh 

mm-xxx-xxx 

DirectPath 

FADD 

6 


PFADD mmregl, mmreg2 

OFh, OFh 

9Eh 

11 -xxx-xxx 

DirectPath 

FADD 

4 


PFADD mmreg, mem64 

OFh, OFh 

9Eh 

mm-xxx-xxx 

DirectPath 

FADD 

6 


PFCMPEQ mmregl, mmreg2 

OFh, OFh 

BOh 

11 -xxx-xxx 

DirectPath 

FADD 

2 


PFCMPEQ mmreg, mem64 

OFh, OFh 

BOh 

mm-xxx-xxx 

DirectPath 

FADD 

4 


PFCMPGE mmregl, mmreg2 

OFh, OFh 

90h 

11 -xxx-xxx 

DirectPath 

FADD 

2 


PFCMPGE mmreg, mem64 

OFh, OFh 

90h 

mm-xxx-xxx 

DirectPath 

FADD 

4 


PFCMPGT mmregl, mmreg2 

OFh, OFh 

AOh 

11 -xxx-xxx 

DirectPath 

FADD 

2 


PFCMPGT mmreg, mem64 

OFh, OFh 

AOh 

mm-xxx-xxx 

DirectPath 

FADD 

4 


PFMAX mmregl, mmreg2 

OFh, OFh 

A4h 

11 -xxx-xxx 

DirectPath 

FADD 

2 


PFMAX mmreg, mem64 

OFh, OFh 

A4h 

mm-xxx-xxx 

DirectPath 

FADD 

4 


PFMIN mmregl, mmreg2 

OFh, OFh 

94h 

11 -xxx-xxx 

DirectPath 

FADD 

2 


PFMIN mmreg, mem64 

OFh, OFh 

94h 

mm-xxx-xxx 

DirectPath 

FADD 

4 


PFMUL mmregl, mmreg2 

OFh, OFh 

B4h 

11 -xxx-xxx 

DirectPath 

FMUL 

4 


PFMUL mmreg, mem64 

OFh, OFh 

B4h 

mm-xxx-xxx 

DirectPath 

FMUL 

6 


PFRCP mmregl, mmreg2 

OFh, OFh 

96h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


PFRCP mmreg, mem64 

OFh, OFh 

96h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 


PFRCPIT1 mmregl, mmreg2 

OFh, OFh 

A6h 

11 -xxx-xxx 

DirectPath 

FMUL 

4 


PFRCPIT1 mmreg, mem64 

OFh, OFh 

A6h 

mm-xxx-xxx 

DirectPath 

FMUL 

6 


PFRCPIT2 mmregl, mmreg2 

OFh, OFh 

B6h 

11 -xxx-xxx 

DirectPath 

FMUL 

4 


PFRCPIT2 mmreg, mem64 

OFh, OFh 

B6h 

mm-xxx-xxx 

DirectPath 

FMUL 

6 


PFRSQIT1 mmregl, mmreg2 

OFh, OFh 

A7h 

11 -xxx-xxx 

DirectPath 

FMUL 

4 



Notes: 

1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line to 
be prefetched. 

2. The byte listed in the column titled ‘imm8’ is actually the opcode byte. 
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Table 16. 3DNow!™ Technology Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Note 

Prefix 

byte(s) 

imm8 

ModRM 

byte 

PFRSQIT1 mmreg, mem64 

OFh, OFh 

A7h 

mm-xxx-xxx 

DirectPath 

FMUL 

6 


PFRSQRT mmregl, mmreg2 

OFh, OFh 

97h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


PFRSQRT mmreg, mem64 

OFh, OFh 

97h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 


PFSUB mmregl, mmreg2 

OFh, OFh 

9Ah 

11 -xxx-xxx 

DirectPath 

FADD 

4 


PFSUB mmreg, mem64 

OFh, OFh 

9Ah 

mm-xxx-xxx 

DirectPath 

FADD 

6 


PFSUBR mmregl, mmreg2 

OFh, OFh 

AAh 

11 -xxx-xxx 

DirectPath 

FADD 

4 


PFSUBR mmreg, mem64 

OFh, OFh 

AAh 

mm-xxx-xxx 

DirectPath 

FADD 

6 


PI2FD mmregl, mmreg2 

OFh, OFh 

ODh 

11 -xxx-xxx 

DirectPath 

FADD 

4 


PI2FD mmreg, mem64 

OFh, OFh 

ODh 

mm-xxx-xxx 

DirectPath 

FADD 

6 


PMULHRW mmregl, mmreg2 

OFh, OFh 

B7h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


PMULHRW mmregl, mem64 

OFh, OFh 

B7h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 


PREFETCH mem8 

OFh 

ODh 

mm-000-xxx 

DirectPath 

- 

~ 

1,2 

PREFETCHW mem8 

OFh 

ODh 

mm-001-xxx 

DirectPath 

- 

~ 

1,2 


Notes: 

1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line to 
be prefetched. 

2. The byte listed in the column titled ‘imm8’ is actually the opcode byte. 
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C.6 3DNow!™ Technology Extensions 


Table 17. 3DNow!™ Technology Extensions 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Prefix 

byte(s) 

imm8 

ModRM 

byte 

PF2IW mmregl, mmreg2 

OFh, OFh 

ICh 

11 -xxx-xxx 

DirectPath 

FADD 

4 

PF2IW mmreg, mem64 

OFh, OFh 

ICh 

mm-xxx-xxx 

DirectPath 

FADD 

6 

PFNACC mmregl, mmreg2 

OFh, OFh 

8Ah 

11 -xxx-xxx 

DirectPath 

FADD 

4 

PFNACC mmreg, mem64 

OFh, OFh 

8Ah 

mm-xxx-xxx 

DirectPath 

FADD 

6 

PFPNACC mmregl, mmreg2 

OFh, OFh 

8Eh 

11 -xxx-xxx 

DirectPath 

FADD 

4 

PFPNACC mmreg, mem64 

OFh, OFh 

8Eh 

mm-xxx-xxx 

DirectPath 

FADD 

6 

PI2FW mmregl, mmreg2 

OFh, OFh 

OCh 

11 -xxx-xxx 

DirectPath 

FADD 

4 

PI2FW mmreg, mem64 

OFh, OFh 

OCh 

mm-xxx-xxx 

DirectPath 

FADD 

6 

PSWAPD mmregl, mmreg2 

OFh, OFh 

BBh 

11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 

PSWAPD mmreg, mem64 

OFh, OFh 

BBh 

mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 
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C.7 SSE Instructions 


Table 18. SSE Instructions 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

ADDPS xmmregl, 
xmmreg2 

OFh 

58h 


11 -xxx-xxx 

Double 

FADD 

5 

1 

ADDPS xmmreg, 
mem128 

OFh 

58h 


mm-xxx-xxx 

Double 

FADD 

7 

1 

ADDSS xmmregl, 
xmmreg2 

F3h 

OFh 

58h 

11 -xxx-xxx 

DirectPath 

FADD 

4 


ADDSS xmmreg, 
mem128 

F3h 

OFh 

58h 

mm-xxx-xxx 

DirectPath 

FADD 

6 


ANDNPS xmmregl, 
xmmreg2 

OFh 

55h 


11 -xxx-xxx 

Double 

FMUL 

3 

1 

ANDNPS xmmreg, 
mem128 

OFh 

55h 


mm-xxx-xxx 

Double 

FMUL 

5 

1 

ANDPS xmmregl, 
xmmreg2 

OFh 

54h 


11 -xxx-xxx 

Double 

FMUL 

3 

1 

ANDPS xmmreg, 
mem128 

OFh 

54h 


mm-xxx-xxx 

Double 

FMUL 

5 

1 

CMPPS xmmregl, 
xmmreg2, imm8 

OFh 

C2h 


11 -xxx-xxx 

Double 

FADD 

3 

1 

CMPPS xmmreg, 
mem128, imm8 

OFh 

C2h 


mm-xxx-xxx 

Double 

FADD 

5 

1 

CMPSS xmmregl, 
xmmreg2, imm8 

F3h 

OFh 

C2h 

11 -xxx-xxx 

DirectPath 

FADD 

2 


CMPSS xmmreg, 
mem32, imm8 

F3h 

OFh 

C2h 

mm-xxx-xxx 

DirectPath 

FADD 

4 


COMISS xmmregl, 
xmmreg2 

OFh 

2Fh 


11 -xxx-xxx 

VectorPath 


4 



Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

COMISS xmmreg, 
mem32 

OFh 

2Fh 


mm-xxx-xxx 

VectorPath 


6 


CVTPI2PS xmmreg, 
mmreg 

OFh 

2Ah 


11 -xxx-xxx 

DirectPath 


4 


CVTPI2PS xmmreg, 
mem64 

OFh 

2Ah 


mm-xxx-xxx 

DirectPath 


6 


CVTPS2PI mmreg, 
xmmreg 

OFh 

2Dh 


11 -xxx-xxx 

DirectPath 


4 


CVTPS2PI mmreg, 
mem128 

OFh 

2Dh 


mm-xxx-xxx 

DirectPath 


6 


CVTSI2SS xmmreg, 
reg32/64 

F3h 

OFh 

2Ah 

11 -xxx-xxx 

VectorPath 


14 


CVTSI2SS xmmreg, 
mem32/64 

F3h 

OFh 

2Ah 

mm-xxx-xxx 

Double 


9 


CVTSS2SI reg32, 
xmmreg 

F3h 

OFh 

2Dh 

11 -xxx-xxx 

Double 


9 


CVTSS2SI reg32, 
mem32 

F3h 

OFh 

2Dh 

mm-xxx-xxx 

VectorPath 


10 


CVTTPS2PI mmreg, 
xmmreg 

OFh 

2Ch 


11 -xxx-xxx 

DirectPath 


4 


CVTTPS2PI mmreg, 
mem128 

OFh 

2Ch 


mm-xxx-xxx 

DirectPath 


6 


CVTTSS2SI reg32, 
xmmreg 

F3h 

OFh 

2Ch 

11 -xxx-xxx 

Double 


9 


CVTTSS2SI reg32, 
mem32 

F3h 

OFh 

2Ch 

mm-xxx-xxx 

VectorPath 


10 


DIVPS xmmregl, 
xmmreg2 

OFh 

5Eh 


11 -xxx-xxx 

Double 

FMUL 

33 



Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

DIVPS xmmreg, 
mem128 

OFh 

5Eh 


mm-xxx-xxx 

Double 

FMUL 

35 


DIVSS xmmregl, 
xmmreg2 

F3h 

OFh 

5Eh 

11 -xxx-xxx 

DirectPath 

FMUL 

16 


DIVSS xmmreg, mem32 

F3h 

OFh 

5Eh 

mm-xxx-xxx 

DirectPath 

FMUL 

18 


LDMXCSR mem32 

OFh 

AEh 


mm-010-xxx 

VectorPath 


13 

4 

MASKMOVQ mmregl, 
mmreg2 

OFh 

F7h 


11 -xxx-xxx 

VectorPath 

FADD/FMUL/ 

FSTORE 

29 


MAXPS xmmregl, 
xmmreg2 

OFh 

5Fh 


11 -xxx-xxx 

Double 

FADD 

3 

1 

MAXPS xmmreg, 
mem128 

OFh 

5Fh 


mm-xxx-xxx 

Double 

FADD 

5 

1 

MAXSS xmmregl, 
xmmreg2 

F3h 

OFh 

5Fh 

11 -xxx-xxx 

DirectPath 

FADD 

2 


MAXSS xmmreg, 
mem32 

F3h 

OFh 

5Fh 

mm-xxx-xxx 

DirectPath 

FADD 

4 


MINPS xmmregl, 
xmmreg2 

OFh 

5Dh 


11 -xxx-xxx 

Double 

FADD 

3 

1 

MINPS xmmreg, 
mem128 

OFh 

5Dh 


mm-xxx-xxx 

Double 

FADD 

5 

1 

MINSS xmmregl, 
xmmreg2 

F3h 

OFh 

5Dh 

11 -xxx-xxx 

DirectPath 

FADD 

2 


MINSS xmmreg, 
mem32 

F3h 

OFh 

5Dh 

mm-xxx-xxx 

DirectPath 

FADD 

4 


MOVAPS xmmregl, 
xmmreg2 

OFh 

28h 


11 -xxx-xxx 

Double 


2 


MOVAPS xmmreg, 
mem128 

OFh 

28h 


mm-xxx-xxx 

Double 


2 



Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

MOVAPS xmmregl, 
xmmreg2 

OFh 

29h 


11 -xxx-xxx 

Double 


2 


MOVAPS mem128, 
xmmreg 

OFh 

29h 


mm-xxx-xxx 

Double 


3 

1 

MOVHLPS xmmregl, 
xmmreg2 

OFh 

12h 


11 -xxx-xxx 

DirectPath 


2 


MOVHPS xmmreg, 
mem64 

OFh 

16h 


mm-xxx-xxx 

DirectPath 


2 


MOVHPS mem64, 
xmmreg 

OFh 

17h 


mm-xxx-xxx 

DirectPath 


2 


MOVLHPS xmmregl, 
xmmreg2 

OFh 

16h 


11 -xxx-xxx 

DirectPath 


2 


MOVLPS xmmreg, 
mem64 

OFh 

12h 


mm-xxx-xxx 

DirectPath 


2 


MOVLPS mem64, 
xmmreg 

OFh 

13h 


mm-xxx-xxx 

DirectPath 


2 


MOVMSKPS reg32, 
xmmreg 

OFh 

50h 


11 -xxx-xxx 

VectorPath 


3 


MOVNTPS mem128, 
xmmreg 

OFh 

2Bh 


mm-xxx-xxx 

Double 


3 

7 

MOVNTQ mem64, 
mmreg 

OFh 

E7h 


mm-xxx-xxx 

DirectPath 

FSTORE 

2 

7 

MOVSS xmmregl, 
xmmreg2 

F3h 

OFh 

lOh 

11 -xxx-xxx 

DirectPath 


2 


MOVSS xmmreg, 
mem32 

F3h 

OFh 

lOh 

mm-xxx-xxx 

Double 


3 


MOVSS xmmregl, 
xmmreg2 

F3h 

OFh 

11 h 

11 -xxx-xxx 

DirectPath 


2 



Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

MOVSS mem32, 
xmmreg 

F3h 

OFh 

11 h 

mm-xxx-xxx 

DirectPath 


2 


MOVUPS xmmregl, 
xmmreg2 

OFh 

lOh 


11 -xxx-xxx 

Double 


2 


MOVUPS xmmreg, 
mem128 

OFh 

lOh 


mm-xxx-xxx 

VectorPath 


7 


MOVUPS xmmregl, 
xmmreg2 

OFh 

11 h 


11 -xxx-xxx 

Double 


2 


MOVUPS mem128, 
xmmreg 

OFh 

11 h 


mm-xxx-xxx 

VectorPath 


4 


MULPS xmmregl, 
xmmreg2 

OFh 

59h 


11 -xxx-xxx 

Double 

FMUL 

5 

1 

MULPS xmmreg, 
mem128 

OFh 

59h 


mm-xxx-xxx 

Double 

FMUL 

7 

1 

MULSS xmmregl, 
xmmreg2 

F3h 

OFh 

59h 

11 -xxx-xxx 

DirectPath 

FMUL 

4 


MULSS xmmreg, 
mem32 

F3h 

OFh 

59h 

mm-xxx-xxx 

DirectPath 

FMUL 

6 


ORPS xmmregl, 
xmmreg2 

OFh 

56h 


11 -xxx-xxx 

Double 

FMUL 

3 

1 

ORPS xmmreg, 
mem128 

OFh 

56h 


mm-xxx-xxx 

Double 

FMUL 

5 

1 

PAVGB mmregl, 
mmreg2 

OFh 

EOh 


11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PAVGB mmreg, mem64 

OFh 

EOh 


mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PAVGW mmregl, 
mmreg2 

OFh 

E3h 


11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PAVGW mmreg, mem64 

OFh 

E3h 


mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 



Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PEXTRW reg32/64, 
mmreg, imm8 

OFh 

C5h 



Double 


4 

4 

PINSRW mmreg, 
reg32/64, imm8 

OFh 

C4h 



Double 

" 

9 

4 

PINSRW mmreg, 
mem16, imm8 

OFh 

C4h 



DirectPath 

" 

4 

4 

PMAXSW mmregl, 
mmreg2 

OFh 

EEh 


11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PMAXSW mmreg, 
mem64 

OFh 

EEh 


mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PMAXUB mmregl, 
mmreg2 

OFh 

DEh 


11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PMAXUB mmreg, 
mem64 

OFh 

DEh 


mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PMINSW mmregl, 
mmreg2 

OFh 

EAh 


11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PMINSW mmreg, 
mem64 

OFh 

EAh 


mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PMINUB mmregl, 
mmreg2 

OFh 

DAh 


11 -xxx-xxx 

DirectPath 

FADD/FMUL 

2 


PMINUB mmreg, 
mem64 

OFh 

DAh 


mm-xxx-xxx 

DirectPath 

FADD/FMUL 

4 


PMOVMSKB reg32/64, 
mmreg 

OFh 

D7h 



VectorPath 

" 

3 

4 

PMULHUW mmregl, 
mmreg2 

OFh 

E4h 


11 -xxx-xxx 

DirectPath 

FMUL 

3 


PMULHUW mmreg, 
mem64 

OFh 

E4h 


mm-xxx-xxx 

DirectPath 

FMUL 

5 


PREFETCHNTA mem8 

OFh 

18h 


mm-000-xxx 

DirectPath 

~ 

~ 

5 


Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PREFETCHTO mem8 

OFh 

18h 


mm-001-xxx 

DirectPath 

- 

- 

5 

PREFETCHT1 mem8 

OFh 

18h 


mm-010-xxx 

DirectPath 

- 

- 

5 

PREFETCHT2 mem8 

OFh 

18h 


mm-011-xxx 

DirectPath 

- 

- 

5 

PSADBW mmregl, 
mmreg2 

OFh 

F6h 


11 -xxx-xxx 

DirectPath 

FADD 

3 


PSADBW mmreg, 
mem64 

OFh 

F6h 


mm-xxx-xxx 

DirectPath 

FADD 

5 


PSHUFW mmregl, 
mmreg2, imm8 

OFh 

70h 



DirectPath 

FADD/FMUL 

2 


PSHUFW mmreg, 
mem64, imm8 

OFh 

70h 



DirectPath 

FADD/FMUL 

4 


RCPPS xmmregl, 
xmmreg2 

OFh 

53h 


11 -xxx-xxx 

Double 

FMUL 

4 

1 

RCPPS xmmreg, 
mem128 

OFh 

53h 


mm-xxx-xxx 

Double 

FMUL 

6 

1 

RCPSS xmmregl, 
xmmreg2 

F3h 

OFh 

53h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


RCPSS xmmreg, 
mem32 

F3h 

OFh 

53h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 


RSQRTPS xmmregl, 
xmmreg2 

OFh 

52h 


11 -xxx-xxx 

Double 

FMUL 

4 

1 

RSQRTPS xmmreg, 
mem128 

OFh 

52h 


mm-xxx-xxx 

Double 

FMUL 

6 

1 

RSQRTSS xmmregl, 
xmmreg2 

F3h 

OFh 

52h 

11 -xxx-xxx 

DirectPath 

FMUL 

3 


RSQRTSS xmmreg, 
mem32 

F3h 

OFh 

52h 

mm-xxx-xxx 

DirectPath 

FMUL 

5 


SFENCE 

OFh 

AEh 


11-111-000 

VectorPath 


2/8 

6 


Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

SHUFPS xmmregl, 
xmmreg2, imm8 

OFh 

C6h 


11 -xxx-xxx 

VectorPath 

FMUL 

4 

1 

SHUFPS xmmreg, 
mem128, imm8 

OFh 

C6h 


mm-xxx-xxx 

VectorPath 

FMUL 

6 

2 

SQRTPS xmmregl, 
xmmreg2 

OFh 

51 h 


11 -xxx-xxx 

Double 

FMUL 

39 


SQRTPS xmmreg, 
mem128 

OFh 

51 h 


mm-xxx-xxx 

Double 

FMUL 

41 


SQRTSS xmmregl, 
xmmreg2 

F3h 

OFh 

51 h 

11 -xxx-xxx 

DirectPath 

FMUL 

19 


SQRTSS xmmreg, 
mem32 

F3h 

OFh 

51 h 

mm-xxx-xxx 

DirectPath 

FMUL 

21 


STMXCSR mem32 

OFh 

AEh 


mm-011-xxx 

VectorPath 


11 

4 

SUBPS xmmregl, 
xmmreg2 

OFh 

5Ch 


11 -xxx-xxx 

Double 

FADD 

5 

1 

SUBPS xmmreg, 
mem128 

OFh 

5Ch 


mm-xxx-xxx 

Double 

FADD 

7 

1 

SUBSS xmmregl, 
xmmreg2 

F3h 

OFh 

5Ch 

11 -xxx-xxx 

DirectPath 

FADD 

4 


SUBSS xmmreg, 
mem32 

F3h 

OFh 

5Ch 

mm-xxx-xxx 

DirectPath 

FADD 

6 


UCOMISS xmmregl, 
xmmreg2 

OFh 

2Eh 


11 -xxx-xxx 

VectorPath 


4 


UCOMISS xmmreg, 
mem32 

OFh 

2Eh 


mm-xxx-xxx 

VectorPath 


6 


UNPCKHPS xmmregl, 
xmmreg2 

OFh 

15h 


11 -xxx-xxx 

Double 

FMUL 

3 

1 

UNPCKHPS xmmreg, 
mem128 

OFh 

15h 


mm-xxx-xxx 

Double 

FMUL 

5 

1 


Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 18. SSE Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU pipe(s) 

Latency 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

UNPCKLPS xmmregl, 
xmmreg2 

OFh 

14h 


11 -xxx-xxx 

Double 

FMUL 

3 

3 

UNPCKLPS xmmreg, 
mem128 

OFh 

14h 


mm-xxx-xxx 

Double 

FMUL 

5 

3 

XORPS xmmregl, 
xmmreg2 

OFh 

57h 


11 -xxx-xxx 

Double 

FMUL 

3 

1 

XORPS xmmreg, 
mem128 

OFh 

57h 


mm-xxx-xxx 

Double 

FMUL 

5 

1 


Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. The second latency value indicates when the low half of the result becomes available. 

3. The high half of the result is available one cycle earlier than listed. 

4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal 
pipeline conditions. 

5. For the PREFETCHNTA/TO/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be 
prefetched. 

6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is 
visible to the other stores and instructions. 

7. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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C.8 SSE2 Instructions 

Table 19. SSE2 Instructions 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

ADDPD xmmregl, 
xmmreg2 

66h 

OFh 

58h 

11 -xxx-xxx 

Double 

FADD 

5 

1/2 


ADDPD xmmreg, 
mem128 

66h 

OFh 

58h 

mm-xxx-xxx 

Double 

FADD 

7 

1/2 


ADDSD xmmregl, 
xmmreg2 

F2h 

OFh 

58h 

11 -xxx-xxx 

DirectPath 

FADD 

4 

1/1 


ADDSD xmmreg, 
mem64 

F2h 

OFh 

58h 

mm-xxx-xxx 

DirectPath 

FADD 

6 

1/1 


ANDNPD xmmregl, 
xmmreg2 

66h 

OFh 

55h 

11 -xxx-xxx 

Double 

FMUL 

3 

1/2 


ANDNPD xmmreg, 
mem128 

66h 

OFh 

55h 

mm-xxx-xxx 

Double 

FMUL 

5 

1/2 


ANDPD xmmregl, 
xmmreg2 

66h 

OFh 

54h 

11 -xxx-xxx 

Double 

FMUL 

3 

1/2 


ANDPD xmmreg, 
mem128 

66h 

OFh 

54h 

mm-xxx-xxx 

Double 

FMUL 

5 

1/2 


CMPPD xmmregl, 
xmmreg2, imm8 

66h 

OFh 

C2h 

11 -xxx-xxx 

Double 

FADD 

3 

1/2 


CMPPD xmmreg, 
mem128, imm8 

66h 

OFh 

C2h 

mm-xxx-xxx 

Double 

FADD 

5 

1/2 


CMPSD xmmregl, 
xmmreg2, imm8 

F2h 

OFh 

C2h 

11 -xxx-xxx 

DirectPath 

FADD 

2 

1/1 


CMPSD xmmreg, 
mem64, imm8 

F2h 

OFh 

C2h 

mm-xxx-xxx 

DirectPath 

FADD 

4 

1/1 


COMISD xmmregl, 
xmmreg2 

66h 

OFh 

2Fh 

11 -xxx-xxx 

VectorPath 

FADD 

4 

1 


COMISD xmmreg, 
mem64 

66h 

OFh 

2Fh 

mm-xxx-xxx 

VectorPath 

FADD 

5 

1 


CVTDQ2PD xmmregl, 
xmmreg2 

F3h 

OFh 

E6h 

11 -xxx-xxx 

Double 

FSTORE 

5 

1/2 


CVTDQ2PD xmmreg, 
mem64 

F3h 

OFh 

E6h 

mm-xxx-xxx 

Double 

FSTORE 

7 

1/2 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

CVTDQ2PS xmmregl, 
xmmreg2 

OFh 

5Bh 


11 -xxx-xxx 

Double 

FSTORE 

5 

1/2 


CVTDQ2PS xmmreg, 
mem128 

OFh 

5Bh 


mm-xxx-xxx 

Double 

FSTORE 

7 

1/2 


CVTPD2DQ xmmregl, 
xmmreg2 

F2h 

OFh 

E6h 

11 -xxx-xxx 

VectorPath 

~ 

8 



CVTPD2DQ xmmreg, 
mem128 

F2h 

OFh 

E6h 

mm-xxx-xxx 

VectorPath 

~ 

10 



CVTPD2PI mmreg, 
xmmreg 

66h 

OFh 

2Dh 

11 -xxx-xxx 

VectorPath 

~ 

8 

1/2 


CVTPD2PI mmreg, 
mem128 

66h 

OFh 

2Dh 

mm-xxx-xxx 

VectorPath 

~ 

10 

1/2 


CVTPD2PS xmmregl, 
xmmreg2 

66h 

OFh 

5Ah 

11 -xxx-xxx 

VectorPath 

~ 

8 



CVTPD2PS xmmreg, 
mem128 

66h 

OFh 

5Ah 

mm-xxx-xxx 

VectorPath 

~ 

10 



CVTPI2PD xmmreg, 
mmreg 

66H 

OFH 

2Ah 

11 -xxx-xxx 

Double 

FSTORE 

5 

1/2 


CVTPI2PD xmmreg, 
mem64 

66H 

OFH 

2Ah 

mm-xxx-xxx 

Double 

FSTORE 

7 

1/2 


CVTPS2DQ xmmregl, 
xmmreg2 

66h 

OFh 

5Bh 

11 -xxx-xxx 

Double 

FSTORE 

5 

1/2 


CVTPS2DQ xmmreg, 
mem128 

66h 

OFh 

5Bh 

mm-xxx-xxx 

Double 

FSTORE 

7 

1/2 


CVTPS2PD xmmregl, 
xmmreg2 

OFh 

5Ah 


11 -xxx-xxx 

Double 

~ 

3 

1/2 


CVTPS2PD xmmreg, 
mem64 

OFh 

5Ah 


mm-xxx-xxx 

Double 

~ 

5 

1/2 


CVTSD2SI reg32/64, 
xmmreg 

F2h 

OFh 

2Dh 

11 -xxx-xxx 

Double 

FSTORE 

9 

1/1 


CVTSD2SI reg32/64, 
mem64 

F2h 

OFh 

2Dh 

mm-xxx-xxx 

VectorPath 

FADD/ 

FMUL/ 

FSTORE 

10 

1/1 


CVTSD2SS xmmregl, 
xmmreg2 

F2h 

OFh 

5Ah 

11 -xxx-xxx 

VectorPath 

FSTORE 

12 




Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

CVTSD2SS xmmreg, 
mem64 

F2h 

OFh 

5Ah 

mm-xxx-xxx 

Double 

FSTORE 

9 



CVTSI2SD xmmreg, 
reg32/64 

F2h 

OFh 

2Ah 

11 -xxx-xxx 

Double 

FSTORE 

11 

1/1 


CVTSI2SD xmmreg, 
mem32/64 

F2h 

OFh 

2Ah 

mm-xxx-xxx 

DirectPath 

FSTORE 

6 

1/1 


CVTSS2SD xmmregl, 
xmmreg2 

F3h 

OFh 

5Ah 

11 -xxx-xxx 

DirectPath 

FSTORE 

2 

1/1 


CVTSS2SD xmmreg, 
mem32 

F3h 

OFh 

5Ah 

mm-xxx-xxx 

DirectPath 

FSTORE 

4 

1/1 


CVTSS2SI reg32/64, 
xmmreg 

F3h 

OFh 

2Dh 

11 -xxx-xxx 

Double 

FSTORE 

9 



CVTSS2SI reg32/64, 
mem32 

F3h 

OFh 

2Dh 

mm-xxx-xxx 

VectorPath 

~ 

10 



CVTTPD2DQ xmmregl, 
xmmreg2 

66h 

OFh 

E6h 

11 -xxx-xxx 

VectorPath 

~ 

8 



CVTTPD2DQ xmmreg, 
mem128 

66h 

OFh 

E6h 

mm-xxx-xxx 

VectorPath 

~ 

10 



CVTTPD2PI mmreg, 
xmmreg 

66h 

OFh 

2Ch 

11 -xxx-xxx 

VectorPath 

~ 

8 

1/2 


CVTTPD2PI mmreg, 
mem128 

66h 

OFh 

2Ch 

mm-xxx-xxx 

VectorPath 

~ 

10 

1/2 


CVTTPS2DQ xmmregl, 
xmmreg2 

F3h 

OFh 

5Bh 

11 -xxx-xxx 

Double 

FSTORE 

5 

1/2 


CVTTPS2DQ xmmreg, 
mem128 

F3h 

OFh 

5Bh 

mm-xxx-xxx 

Double 

FSTORE 

7 

1/2 


CVTTSD2SI reg32/64, 
xmmreg 

F2h 

OFh 

2Ch 

11 -xxx-xxx 

Double 

FSTORE 

9 

1/1 


CVTTSD2SI reg32/64, 
mem64 

F2h 

OFh 

2Ch 

mm-xxx-xxx 

VectorPath 

FADD/ 

FMUL/ 

FSTORE 

10 

1/1 


CVTTSS2SI reg32/64, 
xmmreg 

F3h 

OFh 

2Ch 

11 -xxx-xxx 

Double 

FSTORE 

9 



CVTTSS2SI reg32/64, 
mem32 

F3h 

OFh 

2Ch 

mm-xxx-xxx 

VectorPath 

~ 

10 




Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

DIVPD xmmregl, 
xmmreg2 

66h 

OFh 

5Eh 

11 -xxx-xxx 

Double 

FMUL 

37 

1/34 


DIVPD xmmreg, 
mem128 

66h 

OFh 

5Eh 

mm-xxx-xxx 

Double 

FMUL 

39 

1/34 


DIVSD xmmregl, 
xmmreg2 

F2h 

OFh 

5Eh 

11 -xxx-xxx 

DirectPath 

FMUL 

20 

1/17 


DIVSD xmmreg, 
mem64 

F2h 

OFh 

5Eh 

mm-xxx-xxx 

DirectPath 

FMUL 

22 

1/17 


MASKMOVDQU 
xmmregl, xmmreg2 

66h 

OFh 

F7h 

11 -xxx-xxx 

VectorPath 

~ 

43 



MAXPD xmmregl, 
xmmreg2 

66h 

OFh 

5Fh 

11 -xxx-xxx 

Double 

FADD 

3 

1/2 


MAXPD xmmreg, 
mem128 

66h 

OFh 

5Fh 

mm-xxx-xxx 

Double 

FADD 

5 

1/2 


MAXSD xmmregl, 
xmmreg2 

F2h 

OFh 

5Fh 

11 -xxx-xxx 

DirectPath 

FADD 

2 

1/1 


MAXSD xmmreg, 
mem64 

F2h 

OFh 

5Fh 

mm-xxx-xxx 

DirectPath 

FADD 

4 

1/1 


MINPD xmmregl, 
xmmreg2 

66h 

OFh 

5Dh 

11 -xxx-xxx 

Double 

FADD 

3 

1/2 


MINPD xmmreg, 
mem128 

66h 

OFh 

5Dh 

mm-xxx-xxx 

Double 

FADD 

5 

1/2 


MINSD xmmregl, 
xmmreg2 

F2h 

OFh 

5Dh 

11 -xxx-xxx 

DirectPath 

FADD 

2 

1/1 


MINSD xmmreg, 
mem64 

F2h 

OFh 

5Dh 

mm-xxx-xxx 

DirectPath 

FADD 

4 

1/1 


MOVAPD xmmregl, 
xmmreg2 

66h 

OFh 

28h 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVAPD xmmreg, 
mem128 

66h 

OFh 

28h 

mm-xxx-xxx 

Double 

FADD/ 

FMUL/ 

FSTORE 

2 



MOVAPD xmmregl, 
xmmreg2 

66h 

OFh 

29h 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVAPD mem128, 
xmmreg 

66h 

OFh 

29h 

mm-xxx-xxx 

Double 

FSTORE 

3 




Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

MOVD xmmreg, reg32 

66h 

OFh 

6Eh 

11 -xxx-xxx 

VectorPath 

~ 

9 



MOVD xmmreg, mem32 

66h 

OFh 

6Eh 

mm-xxx-xxx 

Double 

FADD/ 

FMUL/ 

FSTORE 

4 



MOVD reg32, xmmreg 

66h 

OFh 

7Eh 

11 -xxx-xxx 

Double 

FSTORE 

4 



MOVD mem32, xmmreg 

66h 

OFh 

7Eh 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 



MOVD xmmreg, reg64 

66h 

OFh 

6Eh 

11 -xxx-xxx 

VectorPath 

~ 

9 



MOVD xmmreg, mem64 

66h 

OFh 

6Eh 

mm-xxx-xxx 

Double 

FADD/ 

FMUL/ 

FSTORE 

4 



MOVD reg64, xmmreg 

66h 

OFh 

7Eh 

11 -xxx-xxx 

Double 

FSTORE 

4 



MOVD mem64, xmmreg 

66h 

OFh 

7Eh 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 



MOVDQ2Q mmreg, 
xmmreg 

F2h 

OFh 

D6h 

11 -xxx-xxx 

DirectPath 

FADD/ 

FMUL 

2 



MOVDQA xmmregl, 
xmmreg2 

66h 

OFh 

6Fh 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVDQA xmmreg, 
mem128 

66h 

OFh 

6Fh 

mm-xxx-xxx 

Double 

FADD/ 

FMUL/ 

FSTORE 

2 



MOVDQA xmmregl, 
xmmreg2 

66h 

OFh 

7Fh 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVDQA mem128, 
xmmreg 

66h 

OFh 

7Fh 

mm-xxx-xxx 

Double 

FSTORE 

3 



MOVDQU xmmregl, 
xmmreg2 

F3h 

OFh 

6Fh 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVDQU xmmreg, 
mem128 

F3h 

OFh 

6Fh 

mm-xxx-xxx 

VectorPath 

~ 

7 



MOVDQU xmmregl, 
xmmreg2 

F3h 

OFh 

7Fh 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVDQU mem128, 
xmmreg 

F3h 

OFh 

7Fh 

mm-xxx-xxx 

VectorPath 

FSTORE 

4 



MOVHPD xmmreg, 
mem64 

66h 

OFh 

16h 

mm-xxx-xxx 

DirectPath 

FADD/ 

FMUL/ 

FSTORE 

2 




Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

MOVHPD mem64, 
xmmreg 

66h 

OFh 

17h 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 



MOVLPD xmmreg, 
mem64 

66h 

OFh 

12h 

mm-xxx-xxx 

DirectPath 

FADD/ 

FMUL/ 

FSTORE 

2 



MOVLPD mem64, 
xmmreg 

66h 

OFh 

13h 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 



MOVMSKPD reg32/64, 
xmmreg 

66h 

OFh 

50h 

11 -xxx-xxx 

VectorPath 

FADD 

3 

1/1 


MOVNTDQ mem128, 
xmmreg 

66h 

OFh 

E7h 

mm-xxx-xxx 

Double 

FSTORE 

3 


2 

MOVNTI mem32/64, 
reg32/64 


OFh 

C3h 

mm-xxx-xxx 

DirectPath 

FSTORE 

~ 



MOVNTPD mem128, 
xmmreg 

66h 

OFh 

2Bh 

mm-xxx-xxx 

Double 

FSTORE 

3 


2 

MOVQ xmmregl, 
xmmreg2 

F3h 

OFh 

7Eh 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVQ xmmreg, mem64 

F3h 

OFh 

7Eh 

mm-xxx-xxx 

Double 

FADD/ 

FMUL/ 

FSTORE 

4 



MOVQ xmmregl, 
xmmreg2 

66h 

OFh 

D6h 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVQ mem64, xmmreg 

66h 

OFh 

D6h 

mm-xxx-xxx 

DirectPath 

FSTORE 

4 



MOVQ2DQ xmmreg, 
mmreg 

F3h 

OFh 

D6h 

11 -xxx-xxx 

Double 

FADD/ 

FMUL 

2 



MOVSD xmmregl, 
xmmreg2 

F2h 

OFh 

lOh 

11 -xxx-xxx 

DirectPath 

FADD/ 

FMUL 

2 



MOVSD xmmreg, 
mem64 

F2h 

OFh 

lOh 

mm-xxx-xxx 

Double 

FADD/ 

FMUL/ 

FSTORE 

2 



MOVSD xmmregl, 
xmmreg2 

F2h 

OFh 

11 h 

11 -xxx-xxx 

DirectPath 

FADD/ 

FMUL 

2 



MOVSD mem64, 
xmmreg 

F2h 

OFh 

11 h 

mm-xxx-xxx 

DirectPath 

FSTORE 

2 




Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

MOVUPD xmmregl, 
xmmreg2 

66h 

OFh 

lOh 


Double 

FADD/ 

FMUL 

2 



MOVUPD xmmreg, 
mem128 

66h 

OFh 

lOh 


VectorPath 

FADD/ 

FMUL/ 

FSTORE 

7 



MOVUPD xmmregl, 
xmmreg2 

66h 

OFh 

11 h 


Double 

FADD/ 

FMUL 

2 



MOVUPD mem128, 
xmmreg 

66h 

OFh 

11 h 


VectorPath 

FSTORE 

4 



MULPD xmmregl, 
xmmreg2 

66h 

OFh 

59h 


Double 

FMUL 

5 

1/2 


MULPD xmmreg, 
mem128 

66h 

OFh 

59h 


Double 

FMUL 

7 

1/2 


MULSD xmmregl, 
xmmreg2 

F2h 

OFh 

59h 


DirectPath 

FMUL 

4 

1/1 


MULSD xmmreg, 
mem64 

F2h 

OFh 

59h 


DirectPath 

FMUL 

6 

1/1 


ORPD xmmregl, 
xmmreg2 

66h 

OFh 

56h 


Double 

FMUL 

3 

1/2 


ORPD xmmreg, 
mem128 

66h 

OFh 

56h 


Double 

FMUL 

5 

1/2 


PACKSSDW xmmregl, 
xmmreg2 

66h 

OFh 

6Bh 


VectorPath 

~ 

4 



PACKSSDW xmmreg, 
mem128 

66h 

OFh 

6Bh 


VectorPath 

~ 

6 



PACKSSWB xmmregl, 
xmmreg2 

66h 

OFh 

63h 


VectorPath 

~ 

4 



PACKSSWB xmmreg, 
mem128 

66h 

OFh 

63h 


VectorPath 

~ 

6 



PACKUSWB xmmregl, 
xmmreg2 

66h 

OFh 

67h 


VectorPath 

~ 

4 



PACKUSWB xmmreg, 
mem128 

66h 

OFh 

67h 


VectorPath 

~ 

6 



PADDB xmmregl, 
xmmreg2 

66h 

OFh 

FCh 


Double 

FADD/ 

FMUL 

2 

1/1 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PADDB xmmreg, 
mem128 

66h 

OFh 

FCh 


Double 

FADD/ 

FMUL 

4 

1/1 


PADDD xmmregl, 
xmmreg2 

66h 

OFh 

FEh 


Double 

FADD/ 

FMUL 

2 

1/1 


PADDD xmmreg, 
mem128 

66h 

OFh 

FEh 


Double 

FADD/ 

FMUL 

4 

1/1 


PADDQ mmregl, 
mmreg2 

OFh 

D4h 



DirectPath 

FADD/ 

FMUL 

2 

1/1 


PADDQ mmreg, mem64 

OFh 

D4h 



DirectPath 

FADD/ 

FMUL 

4 

1/1 


PADDQ xmmregl, 
xmmreg2 

66h 

OFh 

D4h 


Double 

FADD/ 

FMUL 

2 

1/1 


PADDQ xmmreg, 
mem128 

66h 

OFh 

D4h 


Double 

FADD/ 

FMUL 

4 

1/1 


PADDSB xmmregl, 
xmmreg2 

66h 

OFh 

ECh 


Double 

FADD/ 

FMUL 

2 

1/1 


PADDSB xmmreg, 
mem128 

66h 

OFh 

ECh 


Double 

FADD/ 

FMUL 

4 

1/1 


PADDSW xmmregl, 
xmmreg2 

66h 

OFh 

EDh 


Double 

FADD/ 

FMUL 

2 

1/1 


PADDSW xmmreg, 
mem128 

66h 

OFh 

EDh 


Double 

FADD/ 

FMUL 

4 

1/1 


PADDUSB xmmregl, 
xmmreg2 

66h 

OFh 

DCh 


Double 

FADD/ 

FMUL 

2 

1/1 


PADDUSB xmmreg, 
mem128 

66h 

OFh 

DCh 


Double 

FADD/ 

FMUL 

4 

1/1 


PADDUSW xmmregl, 
xmmreg2 

66h 

OFh 

DDh 


Double 

FADD/ 

FMUL 

2 

1/1 


PADDUSW xmmreg, 
mem128 

66h 

OFh 

DDh 


Double 

FADD/ 

FMUL 

4 

1/1 


PADDW xmmregl, 
xmmreg2 

66h 

OFh 

FDh 


Double 

FADD/ 

FMUL 

2 

1/1 


PADDW xmmreg, 
mem128 

66h 

OFh 

FDh 


Double 

FADD/ 

FMUL 

4 

1/1 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PAND xmmregl, 
xmmreg2 

66h 

OFh 

DBh 


Double 

FADD/ 

FMUL 

2 

1/1 


PAND xmmreg, 
mem128 

66h 

OFh 

DBh 


Double 

FADD/ 

FMUL 

4 

1/1 


PANDN xmmregl, 
xmmreg2 

66h 

OFh 

DFh 


Double 

FADD/ 

FMUL 

2 

1/1 


PANDN xmmreg, 
mem128 

66h 

OFh 

DFh 


Double 

FADD/ 

FMUL 

4 

1/1 


PAVGB xmmregl, 
xmmreg2 

66h 

OFh 

EOh 


Double 

FADD/ 

FMUL 

2 

1/1 


PAVGB xmmreg, 
mem128 

66h 

OFh 

EOh 


Double 

FADD/ 

FMUL 

4 

1/1 


PAVGW xmmregl, 
xmmreg2 

66h 

OFh 

E3h 


Double 

FADD/ 

FMUL 

2 

1/1 


PAVGW xmmreg, 
mem128 

66h 

OFh 

E3h 


Double 

FADD/ 

FMUL 

4 

1/1 


PCMPEQB xmmregl, 
xmmreg2 

66h 

OFh 

74h 


Double 

FADD/ 

FMUL 

2 

1/1 


PCMPEQB xmmreg, 
mem128 

66h 

OFh 

74h 


Double 

FADD/ 

FMUL 

4 

1/1 


PCMPEQD xmmregl, 
xmmreg2 

66h 

OFh 

76h 


Double 

FADD/ 

FMUL 

2 

1/1 


PCMPEQD xmmreg, 
mem128 

66h 

OFh 

76h 


Double 

FADD/ 

FMUL 

4 

1/1 


PCMPEQW xmmregl, 
xmmreg2 

66h 

OFh 

75h 


Double 

FADD/ 

FMUL 

2 

1/1 


PCMPEQW xmmreg, 
mem128 

66h 

OFh 

75h 


Double 

FADD/ 

FMUL 

4 

1/1 


PCMPGTB xmmregl, 
xmmreg2 

66h 

OFh 

64h 


Double 

FADD/ 

FMUL 

2 

1/1 


PCMPGTB xmmreg, 
mem128 

66h 

OFh 

64h 


Double 

FADD/ 

FMUL 

4 

1/1 


PCMPGTD xmmregl, 
xmmreg2 

66h 

OFh 

66h 


Double 

FADD/ 

FMUL 

2 

1/1 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PCMPGTD xmmreg, 
mem128 

66h 

OFh 

66h 


Double 

FADD/ 

FMUL 

4 

1/1 


PCMPGTW xmmregl, 
xmmreg2 

66h 

OFh 

65h 


Double 

FADD/ 

FMUL 

2 

1/1 


PCMPGTW xmmreg, 
mem128 

66h 

OFh 

65h 


Double 

FADD/ 

FMUL 

4 

1/1 


PEXTRW reg32/64, 
xmmreg, imm8 

66h 

OFh 

C5h 


Double 

FSTORE 

4 

1/1 


PINSRW xmmreg, 
reg32/64, imm8 

66h 

OFh 

C4h 


VectorPath 

FADD/ 

FMUL 

10 

1/1 


PINSRW xmmreg, 
mem128, imm8 

66h 

OFh 

C4h 


Double 

FADD/ 

FMUL 

4 

1/1 


PMADDWD xmmregl, 
xmmreg2 

66h 

OFh 

F5h 


Double 

FMUL 

4 

1/2 


PMADDWD xmmreg, 
mem128 

66h 

OFh 

F5h 


Double 

FMUL 

6 

1/2 


PMAXSW xmmregl, 
xmmreg2 

66h 

OFh 

EEh 


Double 

FADD/ 

FMUL 

2 

1/1 


PMAXSW xmmreg, 
mem128 

66h 

OFh 

EEh 


Double 

FADD/ 

FMUL 

4 

1/1 


PMAXUB xmmregl, 
xmmreg2 

66h 

OFh 

DEh 


Double 

FADD/ 

FMUL 

2 

1/1 


PMAXUB xmmreg, 
mem128 

66h 

OFh 

DEh 


Double 

FADD/ 

FMUL 

4 

1/1 


PMINSW xmmregl, 
xmmreg2 

66h 

OFh 

EAh 


Double 

FADD/ 

FMUL 

2 

1/1 


PMINSW xmmreg, 
mem128 

66h 

OFh 

EAh 


Double 

FADD/ 

FMUL 

4 

1/1 


PMINUB xmmregl, 
xmmreg2 

66h 

OFh 

DAh 


Double 

FADD/ 

FMUL 

2 

1/1 


PMINUB xmmreg, 
mem128 

66h 

OFh 

DAh 


Double 

FADD/ 

FMUL 

4 

1/1 


PMOVMSKB reg32/64, 
xmmreg 

66h 

OFh 

D7h 


VectorPath 

FADD/ 

FMUL 

3 

1/1 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PMULHUW xmmregl, 
xmmreg2 

66h 

OFh 

E4h 


Double 

FMUL 

4 

1/2 


PMULHUW xmmreg, 
mem128 

66h 

OFh 

E4h 


Double 

FMUL 

6 

1/2 


PMULHW xmmregl, 
xmmreg2 

66h 

OFh 

E5h 


Double 

FMUL 

4 

1/2 


PMULHW xmmreg, 
mem128 

66h 

OFh 

E5h 


Double 

FMUL 

6 

1/2 


PMULLW xmmregl, 
xmmreg2 

66h 

OFh 

D5h 


Double 

FMUL 

4 

1/2 


PMULLW xmmreg, 
mem128 

66h 

OFh 

D5h 


Double 

FMUL 

6 

1/2 


PMULUDQ mmregl, 
mmreg2 

OFh 

F4h 



DirectPath 

FMUL 

3 

1/2 


PMULUDQ mmreg, 
mem64 

OFh 

F4h 



DirectPath 

FMUL 

5 

1/2 


PMULUDQ xmmregl, 
xmmreg2 

66h 

OFh 

F4h 


Double 

FMUL 

4 

1/2 


PMULUDQ xmmreg, 
mem128 

66h 

OFh 

F4h 


Double 

FMUL 

6 

1/2 


POR xmmregl, 
xmmreg2 

66h 

OFh 

EBh 


Double 

FADD/ 

FMUL 

2 

1/1 


POR xmmreg, mem128 

66h 

OFh 

EBh 


Double 

FADD/ 

FMUL 

4 

1/1 


PSADBW xmmregl, 
xmmreg2 

66h 

OFh 

F6h 


Double 

FADD 

4 

1/2 


PSADBW xmmreg, 
mem128 

66h 

OFh 

F6h 


Double 

FADD 

6 

1/2 


PSHUFD xmmregl, 
xmmreg2, imm8 

66h 

OFh 

70h 


VectorPath 

~ 

4 



PSHUFD xmmreg, 
mem128, imm8 

66h 

OFh 

70h 


VectorPath 

~ 

6 



PSHUFHW xmmregl, 
xmmreg2, imm8 

F3h 

OFh 

70h 


Double 

FADD/ 

FMUL 

2 

1/1 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PSHUFHW xmmreg, 
mem128, imm8 

F3h 

OFh 

70h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSHUFLW xmmregl, 
xmmreg2, imm8 

F2h 

OFh 

70h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSHUFLW xmmreg, 
mem128, imm8 

F2h 

OFh 

70h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSLLD xmmregl, 
xmmreg2 

66h 

OFh 

F2h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSLLD xmmreg, 
mem128 

66h 

OFh 

F2h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSLLD xmmreg, imm8 

66h 

OFh 

72h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSLLDQ xmmreg, imm8 

66h 

OFh 

73h 

11-111 -XXX 

Double 

FADD/ 

FMUL 

2 

1/1 


PSLLQ xmmregl, 
xmmreg2 

66h 

OFh 

F3h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSLLQ xmmreg, 
mem128 

66h 

OFh 

F3h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSLLQ xmmreg, imm8 

66h 

OFh 

73h 

11-110-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PSLLW xmmregl, 
xmmreg2 

66h 

OFh 

Flh 


Double 

FADD/ 

FMUL 

2 

1/1 


PSLLW xmmreg, 
mem128 

66h 

OFh 

Flh 


Double 

FADD/ 

FMUL 

4 

1/1 


PSLLW xmmreg, imm8 

66h 

OFh 

71 h 

11-110-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PSRAD xmmregl, 
xmmreg2 

66h 

OFh 

E2h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSRAD xmmreg, 
mem128 

66h 

OFh 

E2h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSRAD xmmreg, imm8 

66h 

OFh 

72h 

11-100-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PS RAW xmmregl, 
xmmreg2 

66h 

OFh 

Elh 


Double 

FADD/ 

FMUL 

2 

1/1 



Notes: 


1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PS RAW xmmreg, 
mem128 

66h 

OFh 

Elh 


Double 

FADD/ 

FMUL 

4 

1/1 


PSRAW xmmreg, imm8 

66h 

OFh 

71 h 

11-100-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PSRLD xmmregl, 
xmmreg2 

66h 

OFh 

D2h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSRLD xmmreg, 
mem128 

66h 

OFh 

D2h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSRLD xmmreg, imm8 

66h 

OFh 

72h 

11-010-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PSRLDQ xmmreg, 
imm8 

66h 

OFh 

73h 

11-011-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PSRLQ xmmregl, 
xmmreg2 

66h 

OFh 

D3h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSRLQ xmmreg, 
mem128 

66h 

OFh 

D3h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSRLQ xmmreg, imm8 

66h 

OFh 

73h 

11-010-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PSRLW xmmregl, 
xmmreg2 

66h 

OFh 

Dlh 


Double 

FADD/ 

FMUL 

2 

1/1 


PSRLW xmmreg, 
mem128 

66h 

OFh 

Dlh 


Double 

FADD/ 

FMUL 

4 

1/1 


PSRLW xmmreg, imm8 

66h 

OFh 

71 h 

11-010-xxx 

Double 

FADD/ 

FMUL 

2 

1/1 


PSUBB xmmregl, 
xmmreg2 

66h 

OFh 

F8h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBB xmmreg, 
mem128 

66h 

OFh 

F8h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSUBD xmmregl, 
xmmreg2 

66h 

OFh 

FAh 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBD xmmreg, 
mem128 

66h 

OFh 

FAh 


Double 

FADD/ 

FMUL 

4 

1/1 


PSUBQ mmregl, 
mmreg2 

OFh 

FBh 



DirectPath 

FADD/ 

FMUL 

2 

1/1 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PSUBQ mmreg, mem64 

OFh 

FBh 



DirectPath 

FADD/ 

FMUL 

5 

1/1 


PSUBQ xmmregl, 
xmmreg2 

66h 

OFh 

FBh 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBQ xmmreg, 
mem128 

66h 

OFh 

FBh 


Double 

FADD/ 

FMUL 

4 

1/1 


PSUBSB xmmregl, 
xmmreg2 

66h 

OFh 

E8h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBSB xmmreg, 
mem128 

66h 

OFh 

E8h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSUBSW xmmregl, 
xmmreg2 

66h 

OFh 

E9h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBSW xmmreg, 
mem128 

66h 

OFh 

E9h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSUBUSB xmmregl, 
xmmreg2 

66h 

OFh 

D8h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBUSB xmmreg, 
mem128 

66h 

OFh 

D8h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSUBUSW xmmregl, 
xmmreg2 

66h 

OFh 

D9h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBUSW xmmreg, 
mem128 

66h 

OFh 

D9h 


Double 

FADD/ 

FMUL 

4 

1/1 


PSUBW xmmregl, 
xmmreg2 

66h 

OFh 

F9h 


Double 

FADD/ 

FMUL 

2 

1/1 


PSUBW xmmreg, 
mem128 

66h 

OFh 

F9h 


Double 

FADD/ 

FMUL 

4 

1/1 


PUNPCKHBW 
xmmregl, xmmreg2 

66h 

OFh 

68h 


Double 

FADD/ 

FMUL 

2 

1/1 


PUNPCKHBW xmmreg, 
mem128 

66h 

OFh 

68h 


Double 

FADD/ 

FMUL 

4 

1/1 


PUNPCKHDQ 
xmmregl, xmmreg2 

66h 

OFh 

6Ah 


Double 

FADD/ 

FMUL 

2 

1/1 


PUNPCKHDQ xmmreg, 
mem128 

66h 

OFh 

6Ah 


Double 

FADD/ 

FMUL 

4 

1/1 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

PUNPCKHQDQ 
xmmregl, xmmreg2 

66h 

OFh 

6Dh 


Double 

FADD/ 

FMUL 

2 

1/1 


PUNPCKHQDQ 
xmmreg, mem128 

66h 

OFh 

6Dh 


Double 

FADD/ 

FMUL 

4 

1/1 


PUNPCKHWD 
xmmregl, xmmreg2 

66h 

OFh 

69h 


Double 

FADD/ 

FMUL 

2 

1/1 


PUNPCKHWD xmmreg, 
mem128 

66h 

OFh 

69h 


Double 

FADD/ 

FMUL 

4 

1/1 


PUNPCKLBW 
xmmregl, xmmreg2 

66h 

OFh 

60h 


Double 

FADD/ 

FMUL 

2 

1/1 


PUNPCKLBW xmmreg, 
mem128 

66h 

OFh 

60h 


Double 

FADD/ 

FMUL 

4 

1/1 


PUNPCKLDQ 
xmmregl, xmmreg2 

66h 

OFh 

62h 


Double 

FADD/ 

FMUL 

2 

1/1 


PUNPCKLDQ xmmreg, 
mem128 

66h 

OFh 

62h 


Double 

FADD/ 

FMUL 

4 

1/1 


PUNPCKLQDQ 
xmmregl, xmmreg2 

66h 

OFh 

6C 


DirectPath 

FADD/ 

FMUL 

2 

2/1 


PUNPCKLQDQ 
xmmreg, mem128 

66h 

OFh 

6C 


DirectPath 

FADD/ 

FMUL/ 

FSTORE 

4 

2/1 


PUNPCKLWD 
xmmregl, xmmreg2 

66h 

OFh 

61 h 


Double 

FADD/ 

FMUL 

2 

1/1 


PUNPCKLWD xmmreg, 
mem128 

66h 

OFh 

61 h 


Double 

FADD/ 

FMUL 

4 

1/1 


PXOR xmmregl, 
xmmreg2 

66h 

OFh 

EFh 


Double 

FADD/ 

FMUL 

2 

1/1 


PXOR xmmreg, 
mem128 

66h 

OFh 

EFh 


Double 

FADD/ 

FMUL 

4 

1/1 


SHUFPD xmmregl, 
xmmreg2, imm8 

66h 

OFh 

C6h 


VectorPath 

~ 

4 



SHUFPD xmmreg, 
mem128, imm8 

66h 

OFh 

C6h 


VectorPath 

~ 

6 



SQRTPD xmmregl, 
xmmreg2 

66h 

OFh 

51 h 


Double 

FMUL 

51 

1/48 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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Table 19. SSE2 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Note 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

SQRTPD xmmreg, 
mem128 

66h 

OFh 

51 h 


Double 

FMUL 

53 

1/48 


SQRTSD xmmregl, 
xmmreg2 

F2h 

OFh 

51 h 


DirectPath 

FMUL 

27 

1/24 


SQRTSD xmmreg, 
mem64 

F2h 

OFh 

51 h 


DirectPath 

FMUL 

29 

1/24 


SUBPD xmmregl, 
xmmreg2 

66h 

OFh 

5Ch 


Double 

FADD 

5 

1/2 


SUBPD xmmreg, 
mem128 

66h 

OFh 

5Ch 


Double 

FADD 

7 

1/2 


SUBSD xmmregl, 
xmmreg2 

F2h 

OFh 

5Ch 


DirectPath 

FADD 

4 

1/1 


SUBSD xmmreg, 
mem128 

F2h 

OFh 

5Ch 


DirectPath 

FADD 

6 

1/1 


UCOMISD xmmregl, 
xmmreg2 

66h 

OFh 

2Eh 


VectorPath 

FADD 

4 

1/1 


UCOMISD xmmreg, 
mem64 

66h 

OFh 

2Eh 


VectorPath 

FADD 

5 

1/1 


UNPCKHPD xmmregl, 
xmmreg2 

66h 

OFh 

15h 


Double 

FADD/ 

FMUL 

2 

1/1 


UNPCKHPD xmmreg, 
mem128 

66h 

OFh 

15h 


Double 

FADD/ 

FMUL/ 

FSTORE 

4 

1/1 


UNPCKLPD xmmregl, 
xmmreg2 

66h 

OFh 

14h 


DirectPath 

FADD/ 

FMUL 

2 

2/1 


UNPCKLPD xmmreg, 
mem128 

66h 

OFh 

14h 


DirectPath 

FADD/ 

FMUL/ 

FSTORE 

4 

2/1 


XORPD xmmregl, 
xmmreg2 

66h 

OFh 

57h 


Double 

FMUL 

3 

1/2 


XORPD xmmreg, 
mem128 

66h 

OFh 

57h 


Double 

FMUL 

5 

1/2 



Notes: 

1. The low half of the result is available one cycle earlier than listed. 

2. This is the execution latency for the instruction. The time to complete the external write depends on the memory 
speed and the hardware implementation. 
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C.9 SSE3 Instructions 


Table 20. SSE3 Instructions 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

ADDSUBPD xmmregl, 
xmmreg2 

66h 

OFh 

DOh 

11 -xxx-xxx 

Double 

FADD 

5 

1/2 

ADDSUBPD xmmreg, 
mem128 

66h 

OFh 

DOh 

mm-xxx-xxx 

Double 

FADD 

7 

1/2 

ADDSUBPS xmmregl, 
xmmreg2 

F2h 

OFh 

DOh 

11 -xxx-xxx 

Double 

FADD 

5 

1/2 

ADDSUBPS xmmreg, 
mem128 

F2h 

OFh 

DOh 

mm-xxx-xxx 

Double 

FADD 

7 

1/2 

FISTTP [mem16int] 


DF 


mm-010-xxx 

DirectPath 

FSTORE 

4 


FISTTP [mem32int] 


DB 


mm-010-xxx 

DirectPath 

FSTORE 

4 


FISTTP [mem64int] 


DD 


mm-010-xxx 

DirectPath 

FSTORE 

4 


HADDPD xmmregl, 
xmmreg2 

66h 

OFh 

7Ch 

11 -xxx-xxx 

Double 

FADD 

5 

1/2 

HADDPD xmmreg, 
mem128 

66h 

OFh 

7Ch 

mm-xxx-xxx 

VectorPath 

FADD 

6 

1/2 

HADDPS xmmregl, 
xmmreg2 

F2h 

OFh 

7Ch 

11 -xxx-xxx 

Double 

FADD 

5 

1/2 

HADDPS xmmregl, 
mem128 

F2h 

OFh 

7Ch 

mm-xxx-xxx 

VectorPath 

FADD 

6 

1/2 

HSUBPD xmmregl, 
xmmreg2 

66h 

OFh 

7Dh 

11 -xxx-xxx 

Double 

FADD 

5 

1/2 

HSUBPD xmmregl, 
mem128 

66h 

OFh 

7Dh 

mm-xxx-xxx 

VectorPath 

FADD 

6 

1/2 

HSUBPS xmmregl, 
xmmreg2 

F2h 

OFh 

7Dh 

11 -xxx-xxx 

Double 

FADD 

5 

1/2 

HSUBPS xmmreg, 
mem128 

F2h 

OFh 

7Dh 

mm-xxx-xxx 

VectorPath 

FADD 

6 

1/2 

LDDQU xmmreg, 
mem128 

F2 

OF 

FO 

mm-xxx-xxx 

VectorPath 


7 

1/2 

MOVDDUP xmmregl, 
xmmreg2 

F2h 

OFh 

12h 

11 -xxx-xxx 

Double 

FMUL 

2 

1/2 

MOVDDUP xmmregl, 
mem64 

F2h 

OFh 

12h 

mm-xxx-xxx 

Double 

FMUL 

4 

1/2 

MOVSHDUP xmmregl, 
xmmreg2 

F3h 

OFh 

16h 

11 -xxx-xxx 

Double 

FMUL 

3 

1/2 
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Table 20. SSE3 Instructions (Continued) 


Syntax 

Encoding 

Decode 

type 

FPU 

pipe(s) 

Latency 

Throughput 

Prefix 

byte 

First 

byte 

2nd 

byte 

ModRM byte 

MOVSHDUP xmmreg, 
mem128 

F3h 

OFh 

16h 

mm-xxx-xxx 

Double 

FMUL 

5 

1/2 

MOVSLDUP xmmregl, 
xmmreg2 

F3h 

OFh 

12h 

11 -xxx-xxx 

Double 

FMUL 

3 

1/2 

MOVSLDUP xmmregl, 
mem128 

F3h 

OFh 

12h 

mm-xxx-xxx 

Double 

FMUL 

5 

1/2 
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Appendix D AGP Considerations 


Fast write transactions are AGP data transfers that originate from processor-issued memory writes. 
Frequently, the target of fast writes are graphics accelerators and involve: 

• Memory-mapped I/O registers (for example, the command FIFO). 

• Graphics (2D/3D) engines. 

• DVD (motion compensation, sub-picture, etc.) engine registers. 

• Frame buffer (render buffers, textures, etc.) 

This appendix covers the following topics: 


Topic 

Page 

Fast-Write Optimizations 

345 

Fast-Write Optimizations for Graphics-Engine Programming 

346 

Fast-Write Optimizations for Video-Memory Copies 

349 

Memory Optimizations 

351 

Memory Optimizations for Graphics-Engine Programming Using the DMA Model 

352 

Optimizations for Texture-Map Copies to AGP Memory 

353 

Optimizations for Vertex-Geometry Copies to AGP Memory 

353 


D.1 Fast-Write Optimizations 

Fast-write transfers use the PCI addressing semantics but transfer data using the AGP transfer rates 
(for example, 2x, 4x, or 8x) and AGP flow control between data blocks. The 
AMD-8151™ HyperTransport™ AGP 3.0 graphics tunnel converts processor memory writes 
(embedded in HyperTransport traffic) into fast-write transactions on the AGP bus. Fast writes offer an 
alternative to having the processor place data in memory, and then having the AGP accelerator read 
the data. 

Fast-write transfers are generated to the accelerator with a transfer start address, and then transfer data 
32 bits at a time {start address + 0, start address + 4, start address + 8, and so on) until the entire 
block has been transferred. In this sense, the data is sequential (as it is in DMA). Following are the 
AGP bus characteristics: 

• The AGP bus clock is 66 MHz. 

• The AGP data width is 32 bits; at the 8x transfer rate, eight doublewords (32 bytes) can be 
transferred per AGP clock. 
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The theoretical data bandwidths for fast writes at 2x, 4x, and 8x are approximately 528 Mbytes/s, 
1.056 Gbytes/s, and 2.1 Gbytes/s, respectively. These numbers are theoretical in terms of sustained 
bursts occurring on the AGP bus. In actuality, data bandwidth depends on the size of the data block 
transferred from the processor—larger block transfers are better. 

Real bandwidth will be lower than the theoretical bandwidth because the beginning of fast-write 
transactions require sending a PCI-protocol start transaction cycle (for the address phase) at the lx 
transfer rate instead of the higher speeds (2x, 4x, or 8x). 

Larger block transfers help hide the transaction-start overhead (smaller block transfers have lower 
bandwidth). For example, at the 8x data-transfer rate, 128 bytes of data can be transferred in four 
AGP clock cycles, but one initial clock cycle is required for the address phase. Five clock cycles are 
required to transfer 128 bytes of data; therefore, the overhead of the address phase (clock cycle 1) for 
128 bytes of data transferred is 20% (yielding a bandwidth of approximately 1.7 Gbytes/s). See 
Figure 10. 


12 3 4 5 6 7 8 9 

CLK 
AD 


C/BE 

Figure 10. AGP 8x Fast-Write Transaction 

The overhead of the address phase for 64 bytes of data is 33% (yielding a bandwidth of approximately 
1400 Mbytes/s). For 32 bytes of data (or less), the bandwidth drops to approximately 1000 Mbytes/s. 
A key software optimization is to buffer as much processor write data as practical. 

D.2 Fast-Write Optimizations for Graphics-Engine 
Programming 

Write-combining provides excellent AGP fast-write bandwidth when using the programmed I/O 
(PIO) model—not the DMA model—for programming 2-D and 3-D graphics engines. To help ensure 
that data is sent in optimal block sizes, you can “shadow” the engine’s render commands (that is, the 
registers needed for a render command) in cache-block-aligned data structures in system memory. 

Shadowing the structure in system memory (instead of writing the actual write-combining buffer in 
memory-mapped I/O space) ensures that the write buffer is not emptied prematurely by external 
events (such as an uncacheable read or hardware interrupt). Shadowing also ensures that writes to 
different cache lines in the structure do not flush (close) the write-combining buffer since the number 
of write-combining buffers that can be open at one time is processor-implementation dependent. 



346 


AGP Considerations 


Appendix D 




25112 Rev. 3.06 September 2005 


_ AM PH 

Software Optimization Guide for AMD64 Processors 


On the AMD Athlon™ 64 and AMD Opteron™ processors, write-combining can be used, and 
software can take advantage of the fact that writes are sent out of the processor's write buffers in 
ascending order (and appear on HyperTransport that way), from low quadword to high quadword. 

Use the Memory Type Range Register (MTRR) mechanism in conjunction with the PAT MSR 
(model-specific register 277h) to enable write-combining as the memory type for the FIFO address 
space. 

To enable write-combining as the memory type for the FIFO address space, follow these steps: 

1. Change the PAT MSR entries that contain a type value of OOh (UC-uncacheable) to a type value of 
07h (UC-minus). 

2. Program an MTRR with the physical addres and mask range of the command FIFO. 

Note: MTRR registers mark addresses on page granularity boundaries of 4 Kbytes, so the FIFO 
address should begin on a 4-Kbyte-aligned address boundary). 

For more information, see Chapter 7, “Memory System,” in volume 2 of the AMD64 Architecture 
Programmer’s Manual, order# 24593. 

Many graphics engines have a front-end command FIFO that requires the render command to be 
issued first, followed by a variable number of doublewords, depending on the render command. 

Create a cache-aligned command structure in cacheable memory, map the rendering command into 
the lowest doubleword of the structure (which will be issued first), map the next data required in the 
command into the next structure element, and so on, until all the data “registers” for this command are 
included in the structure. An example is given in Figure 11. 


Doubleword 16 (3Fh) 


Top of cache line 


Doubleword 2 (8h) 

Doubleword 1 (4h) 

Doubleword 0 (Oh) 

Figure 11. Cacheable-Memory Command Structure 

When the command (or commands) are filled in the shadowed structure, use a high-speed copy 
routine like the one shown in Listing 31 on page 348. Copy the structure to the actual graphic 
accelerator’s write-combining FIFO address space. Locating the write-combining command FIFO at 
a cache-aligned address is slightly better, since one HyperTransport link-size write occurs instead of 
two). 


Parameter 2 


Parameter 1 


Render command 1 
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If there are any “empty” doublewords between the last parameter and the top of the cache line, use the 
SFENCE instruction to flush the write-combining buffer. The data is issued in ascending order. 
SFENCE is needed to flush the processor’s write-combining buffer on any partially filled buffer. In 
general, use SFENCE when all parameters needed for rendering have been copied to the memory- 
mapped EO (MMIO) FIFO. This ensures that write data is not kept in the processor’s write¬ 
combining buffer (which prevents the graphics engine from receiving an incomplete command until 
the buffer is eventually flushed). 

The AGP 3.0 specification specifies that accelerators must be able to buffer at least 128 bytes for the 
initial data block transferred. Try using 64-128 bytes as the optimal transfer size whenever possible 
(one to two processor cache lines). Map as many commands as will fit into this 64-128-byte structure. 

Listing 31. Sending Write-Combined Data to the Graphics-Engine Command FIFO 

/* Send commands to a graphic accelerator 2D engine. */ 

/* The shadowed structure contains 32 DWORDs worth of */ 

/* rendering commands and data parameters. */ 

/* Send out 128 (8Oh) bytes to FIFO in WC MMIO space. */ 

/* First load 64-bit pointer to a cached command structure. */ 

mov rdi, OFFSET ShadowRegs_Structure 

/* We now have a pointer to the shadowed engine structure. */ 

/* Grab 16 bytes at a time. */ 

movdqa xmmO, [rdi] 

movdqa xmml, [rdi + 16] 

movdqa xmm2, [rdi + 32] 

movdqa xmm3, [rdi + 48] 

movdqa xmm4, [rdi + 64] 

movdqa xmm5, [rdi + 80] 

movdqa xmm6, [rdi + 96] 

movdqa xmm7, [rdi + 112] 

/* Now get linear pointer to graphic engine mapped in */ 

/* WC address space. */ 

mov rax, PTR [Linear2Dengine_Ptr] 

/* Now copy register data to processor's WC buffer. */ 

/* It is slightly more optimal if the command FIFO */ 

/* is at a cache-line-aligned address. */ 

/* Write 16 bytes at a time. */ 

movdqa [rax], xmmO 
movdqa [rax + 16], xmml 
movdqa [rax + 32], xmm2 

/* The first WC buffer will be sent after the next write */ 

/* (assuming FIFO is cache-line aligned) since we are crossing */ 

/* a cache-line boundary. */ 
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movdqa [rax + 48], xmm3 

/* Allocate and fill another WC buffer. */ 


movdqa 

movdqa 

movdqa 


[rax + 64] , xmm4 
[rax + 80], xmm5 
[rax + 96], xmm6 


/* The second WC buffer is forced after the next write. */ 
/* The linear ascending order between cache lines */ 

/* is maintained since buffer is sent when filled. */ 


movdqa [rax + 112], xmm7 
SFENCE 


/* The SFENCE forces the write-combining buffer */ 

/* out of the processor and to the graphics chip. */ 
/* Set up the next drawing commands in cached */ 

/* memory structure ShadowRegs_Structure. */ 


D.3 Fast-Write Optimizations for Video-Memory 
Copies 

When performing block copies of an image to the graphics accelerator’s local memory, you can 
preserve the contents of the LI and L2 caches and reduce cache-line-replacement traffic to system 
memory by using a nontemporal block prefetch on the image data using the PREFETCHNTA 
instruction. This works well with images loaded into system memory through disk DMA because the 
data can be kept out of the L2 cache and mostly out of the LI data cache (when using 
PREFETCHNTA). This is illustrated in Listing 32 

Note: On the AMD Athlon™ 64 and AMD Opteron™processors, PREFETCHNTA uses one way of 
the two-way set-associative LI data cache. One way of the LI data cache is 32 Kbytes, so 
limit the block prefetch size to less than or equal to 32 Kbytes. 

Listing 32. Writing Nontemporal Data to Video RAM 

/* Copy an image larger than 32 Kbytes into local memory, */ 

/* but limit the block prefetch so as not to exceed 32 Kbytes, */ 

/* which is the size of the nontemporal cache. */ 

/* First, block prefetch 16 Kbytes into the LI data cache, then write */ 

/* it to the frame buffer. */ 

/* On AMD Athlon 64 and AMD Opteron processors, the PREFETCHNTA instruction must 
execute prior */ 

/* to subsequent instructions. */ 

/* Cache lines that are prefetched via PREFETCHNTA and later replaced are */ 

/* not evicted to the L2 cache or system memory. */ 
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// Use half of the 32-Kbyte nontemporal cache for a block load. 

#define HALFL1PREFETCHNTACACHESIZE 16384 

mov rdi, QWORD PTR [image_source] 

mov rex, HALFL1PREFETCHNTACACHESIZE / 64 

Block_PrefetchlntoLl: 

prefetchnta QWORD PTR [rdi] ; Grab 64 bytes. 

add rdi, 64 ; Bump up to next cache line. 

dec rex 

jnz Block_PrefetchlntoLl 

LoadPtr_ToFrameBuffer: 

mov rdi, QWORD PTR [frameBuffDestPtr] 
mov rex, HALFL1PREFETCHNTACACHESIZE / 128 

/* Get linear pointer to local memory mapped in WC address space. */ 
mov rax, DQWORD PTR [FBimage_Ptr] 

/* Send out 128 bytes (yielding ~1.7 Gbytes/s of fast-write bandwidth) */ 
/* per block. RDI now has pointer back to image source. */ 

/* 16 Kbytes of image is in LI nontemporal cache (way 0 of cache). */ 


Block WriteToFrameBuffer 

movdqa 

xmmO, 

[rdi] 

movdqa 

xmml, 

[rdi+16] 

movdqa 

xmm2, 

[rdi+32] 

movdqa 

xmm3, 

[rdi+48] 

movdqa 

xmm4, 

[rdi+64] 

movdqa 

xmm5, 

[rdi+80] 

movdqa 

xmm6, 

[rdi+96] 

movdqa 

xmm7, 

[rdi+112] 


/* Copy register data to WC buffer. */ 

movdqa [rax], xmmO 
movdqa [rax+16], xmml 
movdqa [rax+32], xmm2 

/* The first WC buffer is sent after next write since we are crossing */ 
/* a cache-line boundary. */ 

movdqa [rax+48], xmm3 

/* Allocate and fill another WC buffer. */ 

movdqa [rax+64], xmm4 
movdqa [rax+80], xmm5 
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movdqa [rax+96], xmm6 
movdqa [rax+112], xmm7 

add rax, 128 ; Bump up by 2 cache lines 

add rdi, 128 ; for source and destination, 

dec rex 

jnz Block_WriteToFrameBuffer 
ChunkOfImageCopied: 

/* Set up for next block in image (if necessary) */ 

/* until image is transferred. */ 

D.4 Memory Optimizations 

AGP memory is system memory that is partitioned from the same memory that the operating system 
and applications use. The AGP card plugged into the AGP bus is always considered the master when 
performing AGP memory accesses since it reads and writes the system memory. The AGP card uses 
AGP memory for a variety of “surfaces,” including: 

• Texture maps 

• 3-D object geometry and vertex data streams 

• Command buffers for 2-D and 3-D graphics engines 

• Video-capture buffers 

• Frame buffer (cost-reduced implementations) 

The system memory used for AGP mastering is attached to the processor that has one of its 
HyperTransport links connected to an AGP tunnel device, such as the AMD-8151 HyperTransport 
AGP 3.0 graphics tunnel. AGP card requests (reads/writes) come into the processor through the 
HyperTransport link input and are arbitrated with processor requests for system memory in the 
system request queue (SRQ). From here, the AGP request address is passed into the processor’s 
address map and GART (graphics aperture remapping table), where the AGP physical address is 
translated into a physical DRAM page address, which can then be presented to the processor’s 
memory controller. Therefore, host processor to system memory throughput directly affects AGP 
memory bandwidth and throughput, as the two compete for SRQ entries and memory bandwidth. 
Figure 10 shows the command flow from the HyperTransport links to the SRQ. 
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CPU 0 



Hypertransport 0 HyperTransport 1 HyperTransport 2 

Output Output Output 


Figure 12. Northbridge Command Flow 


D.5 Memory Optimizations for Graphics-Engine 
Programming Using the DMA Model 

Historically (that is, with AGP 1.0 and AGP 2.0), AGP memory used for command DMA buffers was 
accessed by the processor through the AGP aperture space (this feature is referred to as host 
translation). This address space was mapped as write-combining due to the fact that the processor’s 
caches were not snooped by an AGP master (that is, coherency was not enforced for AGP memory). 
Write-combining offered the best bandwidth in this situation because write-combining buffers could 
be sent to system memory as full write-combining buffers. However, system memory still needed to 
be written, which used memory bandwidth. 

On current systems however, coherency between an AGP master (making accesses through the AGP 
aperture) and the processor caches is maintained due to the HyperTransport protocol and the MOESI 
(modified, owner, exclusive, shared, invalid) caching policy. Coherency support between an AGP 
master and the processor caches is enabled through a bit in the GART entry (Gart_entry.coh). The 
AGP miniport driver sets this bit as it maps entries in the GART. The video graphics miniport driver 
can verify this feature in the AGP 3.0-compliant register (AGPSTAT.ita_entry.coh), which is found in 
the AGP bridge device. 

Note: Coherency support is implemented by hardware in AMD Athlon 64 and AMD Opteron 

processors, and is not specific to the A GP tunnel device, even though the support is indicated 
in the tunnels AGP 3.0-compliant register (AGPSTAT.ita_entry.coh). 

Therefore, a key optimization for the DMA model on AMD Athlon 64 and AMD Opteron processors 
is that the AGP master may read the data from the processor caches faster than reading data from the 
DDR memory, since the processor caches operate at higher clock frequencies. As processor clock 
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frequencies increase, so will the ratio of operating frequencies between processor caches and DDR 
memory. The processor-to-write-back cache bandwidth is also higher than processor-to-AGP-aperture 
bandwidth (write-combining memory type), since the DDR writes are avoided (as well as GART 
translation latencies). 

It may be possible to prevent pollution of the LI-data and L2 caches from DMA data by using the 
nontemporal PREFETCHNTA instruction on the DMA buffer and limiting prefetching of the DMA 
buffer to less than 32 Kbytes (PREFETCHNTA uses only one way of the LI data cache). 

Use PREFETCHNTA on the linear address to the DMA buffer, and not the AGP aperture address, 
before reading or writing the DMA buffer. 

Another key optimization for the DMA model on AMD Athlon 64 and AMD Opteron systems is that 
coherency is maintained between processor caches and an AGP master making accesses outside of 
the AGP aperture. 

This is a key AGP enhancement that is required of AGP 3.0 target (host platform) systems. 

In effect, this means that an AGP master can create a DMA buffer in normal write-back memory and 
then pass the physical DRAM page address to the AGP master; in other words, the AGP virtual 
address and GART translation is not used. 

Use PREFETCHNTA on the linear address to the DMA buffer, before reading or writing the DMA 
buffer. 

If the AGP card hardware is capable of buffering the physical DRAM page addresses sent to the AGP 
card in a FIFO, then in effect the AGP card’s device driver is getting AGP scatter-gather capabilities, 
with cache coherency provided by the processor. 

D.6 Optimizations for Texture-Map Copies to AGP 
Memory 

To avoid cache pollution, use the same technique described in “Fast-Write Optimizations for Video- 
Memory Copies” on page 349 to copy texture data into AGP memory, since this data tends to be 
nontemporal. 

D.7 Optimizations for Vertex-Geometry Copies to AGP 
Memory 

To avoid cache pollution, use the same technique described in “Fast-Write Optimizations for Video- 
Memory Copies” on page 349 to copy vertex data into AGP memory, since this data tends to be 
nontemporal. 
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Appendix E SSE and SSE2 Optimizations 


This appendix describes specific optimizations that can be utilized to improve performance when 
using SSE and SSE2 instructions on AMD Athlon™ 64 and AMD Opteron™ processors. 

Types of XMM-Register Data 

The XMM registers (used by the SSE and SSE2 instructions) can hold the following three types of 
data: 

• Floating-point single-precision (FPS) 

• Floating-point double-precision (FPD) 

• Integer (ENT) 

Types of SSE and SSE2 Instructions 

Most SSE and SSE2 instructions can be divided into five types according to the type of data they 
produce and therefore expect to consume: 

• Floating-point single-precision (FPS) 

• Floating-point double-precision (FPD) 

• Integer (ENT) 

• Foad (produces data of type FPS, FPD, or INT) 

• Store (can consume a register with data of any type) 

This appendix covers the following topics: 


Topic 

Page 

Half-Register Operations 

356 

Zeroing Out an XMM Register 

357 

Reuse of Dead Registers 

359 

Moving Data Between XMM Registers and GPRs 

360 

Saving and Restoring Registers of Unknown Format 

361 

SSE and SSE2 Copy Loops 

362 

Data Conversion 

364 
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E.1 Half-Register Operations 

Optimization 

Take care when mixing data types of operands within the same register. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

Mixing data types in a single register is harmless if only scalar operations are used. However, this 
practice can cause performance problems if the register is used as a sourcce for a vector operation. 

Example 1 

Avoid code like this: 

addps xmml, xmm2 ; Add four packed single-precision (FPD) values in XMM1 

; to their corresponding values in XMM2. 
cvtss2sd xmml, xmm2 ; Convert the low-order single-precision value in XMM2 

; to 64-bit double precision FP format and store in 
; lower 64-bits of XMM1. 

In this example, the second instruction leaves the upper half of XMM1 in FPS format and the lower 
half in FPD format. 

Example 2 

Avoid code like this: 

addps xmml,xmm2 ; Add four packed single-precision (FPD) values in XMM1 

; to their corresponding values in XMM2. 

movlpd xmml,mem64 ; Move the double-precision value in mem64 to the lower 

; half of XMM1. 

In this example, The MOVLPD instruction sets the low half of XMM1 to FPD format but leaves the 
high half unchanged (in FPS format). 
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E.2 Zeroing Out an XMM Register 

Optimization 

When it is necessary to zero out an XMM register, use an instruction whose format matches the 
format required by the consumers of the zeroed register. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

When an XMM register must be set to zero, using the appropriate instruction helps reduce the chance 
of any performance penalty later. 

Table 21 shows the different possible consumers of an XMM register and the corresponding 
instruction that should be used to zero out the register. 

Table 21. Clearing XMM Registers 


Producer of Zero 

Example Consumers of Zero 

xorpd xmml, xmml 

cmppd xmml, xmm2 

cmpsd xmml, xmm2 

comisd xmml, xmm2 

maxpd xmml, xmm2 

maxsd xmml, xmm2 

ucomisd xmml, xmm2 

subsd xmml, xmm2 

xorps xmml, xmml 

cmpps xmml, xmm2 

cmpss xmml, xmm2 

comiss xmml, xmm2 

maxps xmml, xmm2 

maxss xmml, xmm2 

ucomiss xmml, xmm2 

subss xmml, xmm2 
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Table 21. Clearing XMM Registers (Continued) 


Producer of Zero 

Example Consumers of Zero 

pxor xmml, xmml 

pcmpxxx xmml, xmm2 

pmaxxx xmml, xmm2 

psubxxx xmml, xmm2 
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E.3 Reuse of Dead Registers 

Optimization 

When it is necessary to save the contents of a register that is in FPS format to another unused (or 
dead) register, where the previous contents of the dead register are unknown and could be a denormal, 
then use movaps xmml, xmm2 instead of movss xmml, xmm2. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The movss xmml, xmm 2 instruction takes additional time to execute if the previous contents of 
XMM1 are a denormal. 


Appendix E 


SSE and SSE2 Optimizations 


359 



AMpg _ 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


E.4 Moving Data Between XMM Registers and GPRs 

Optimization 

Store a register that needs to be spilled in memory, rather than moved to a different register file. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

While register moves within a given register file are very efficient (XMM to XMM, GPR to GPR), 
moves between register files (XMM to GPR, GPR to XMM) are not. . 
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E.5 Saving and Restoring Registers of Unknown 
Format 

Optimization 

Use INT loads (MOVDQA for 128 bits and MOVQ for 64 bits) when restoring registers of unknown 
format from the stack. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

All stores of 64-bits or more from an XMM register to memory may be performed without concern 
for the type of the data in the XMM register. This allows called procedures to save registers on the 
stack without knowing what their format was. Conversely, all INT loads (MOVDQA for 128 bits and 
MOVQ for 64 bits) leave the register in a format that is acceptable to all SSE and SSE2 instructions 
and is recommended when restoring registers of unknown format from the stack. 
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E.6 SSE and SSE2 Copy Loops 

Optimization 

When copying data of an unknown format using the XMM registers, it is best to use INT loads and 
stores. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the 
following pattern—Load, Store, Load, Store, Load, Store, etc. 

If in 32-bit mode and using MMX instructions to perform loads and stores, they should be arranged in 
the following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc. 

Example 

The following example illustrates a sequence of 128-bit loads and stores: 

movdqa xmmO, [rdx+r8*8] ; Load 

movntdq [rcx+r8*8], xmmO ; Store 

movdqa xmml, [rdx+r8*8+16] ; Load 

movntdq [rcx+r8*8+16], xmml ; Store 
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E.7 Explicit Load Instructions 

Optimization 

Use movlpd xmmi, mem64 when loading a scalar FPD value from memory. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

The movlpd xmmi, mem64 instruction is more efficient than movsd xmmi, mem64 . Use MOVSD only 
if you need to ensure that the upper half of XMMI is also set to FPD format, perhaps because a vector 
operation is planned on the register. 

When loading a scalar FPS value from memory, use MOVSS. 
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E.8 Data Conversion 

Optimization 

Use care when selecting instructions to convert values from one type to another. 

Application 

This optimization applies to: 

• 32-bit software 

• 64-bit software 

Rationale 

For example, the CVTDQ2PS instruction converts four packed 32-bit signed integer values in an 
XMM register or a 128-bit memory location to four packed single-precision floating-point values and 
writes the converted values to another XMM register. In some cases, an additional instruction is 
recommended to ensure that both halves of register operands are of the same type (as recommended 
in “Zeroing Out an XMM Register” on page 357). 

Table 22 shows the recommendations for register-to-register conversion of scalar values. Table 23 on 
page 365 shows the recommendations for register-to-register conversion of vector operands. When 
converting values directly from memory, use the preferred instructions provided in Table 24 on 
page 365. 


Table 22. Converting Scalar Values 


Source 

format 

Destination format 

Preferred instructions 

Notes 

FPS 

INT XMM 

cvtps2dq xmml, xmm2 


FPS 

INTGPR 

cvtss2si reg32/64, xmml 


FPS 

FPD 

cvtss2sd xmml, xmm2 


FPD 

INT XMM 

unpcklpd xmm2, xmm2 
cvtpd2dq xmml, xmm2 

UNPCKLPD ensures that the high 
half of XMM2 is also in FPD 
format. 

FPD 

INTGPR 

cvtsd2si reg32/64, xmml 


FPD 

FPS 

xorps xmml, xmml 
cvtsd2ss xmml, xmm2 

XORPS ensures that the high half 
of XMM1 is in FPS format in case 
a MOVAPS instruction is used 
later. 

INT XMM 

FPS 

cvtdq2ps xmml, xmm2 


INT XMM 

FPD 

cvtdq2pd xmml, xmm2 
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Table 22. Converting Scalar Values (Continued) 


Source 

format 

Destination format 

Preferred instructions 

Notes 

INTGPR 

FPS 

xorps xmml, xmml 
cvtsi2ss xmml, reg32/64 

XORPS is used to ensure that the 
high half of XMM1 is in FPS 
format. This is also better in case a 
MOVAPS instruction is used later. 

INTGPR 

FPD 

cvtsi2sd xmml, reg32/64 



Table 23. Converting Vector Values 


Source 

format 

Destination format 

Preferred instructions 

Notes 

FPS 

INTXMM 

cvtps2dq xmml, xmm2 


FPS 

FPD 

cvtps2pd xmml, xmm2 


FPD 

INTXMM 

cvtpd2dq xmml, xmm2 


FPD 

FPS 

cvtpd2ps xmml, xmm2 


INTXMM 

FPS 

cvtdq2ps xmml, xmm2 


INTXMM 

FPD 

cvtdq2pd xmml, xmm2 



Table 24. Converting Directly from Memory 


Source 

format 

Destination format 

Preferred instructions 

Notes 

FPD 

FPS 

xorps xmml, xmml 
cvtsd2ss xmml, mem64 

XORPS ensures that the high half 
of XMM1 is in FPS format in case 
a MOVAPS instruction is used 
later. 

INTGPR 

FPS 

xorps xmml, xmml 
cvtsi2ss xmml, mem32/64 

XORPS is used to ensure that the 
high half of XMM1 is in FPS 
format. This is also better in case a 
MOVAPS instruction is used later. 


Appendix E 


SSE and SSE2 Optimizations 


365 








AMpg _ 

Software Optimization Guide for AMD64 Processors 25112 Rev. 3.06 September 2005 


366 


SSE and SSE2 Optimizations 


Appendix E 



_ AMPS 

Software Optimization Guide for AMD64 Processors 


25112 Rev. 3.06 September 2005 


Index 


Numerics 

3DNow! 210, 215, 217-218, 221, 224, 230, 233 

A 

address-generation interlocks 151 
AMD Athlon™ processor 
microarchitecture 250—251 
AMD Athlon™ system bus 260 
arrays 10 

B 

binary-to-ASCII decimal conversion 181 
boolean operators 17 
branch target buffer (BTB) 126, 253 
branches 

align branch targets 76 
based on comparisons between floats 54 
compound branch conditions 14 
dependent on random data 130 
optimizing density of 126 
prediction 253 

replace with computation in 3DNow! code 136 

c 

C language 14 

array notation versus pointers 10 
C code to 3DNow! code examples 138-140 
structures 39, 117 
cache 

64-byte cache line 116 
CALL and RETURN instructions 132 
ccNUMA 96 

code padding using neutral code fillers 89 
code segment (CS) base, nonzero 135 
const type qualifier 30 

D 

data cache 255 
decoding 254 
DirectPath 

DirectPath over VectorPath instructions 72 
displacements, 8-bit sign-extended 88 
division 160—162, 186 

replace division with multiplication, integer 43, 160 
dynamic memory allocation consideration 19 


E 

extended-precision data 248 

F 

far control-transfer instructions 142 
floating-point 

compare instructions 244 
division and square roots 50 
execution unit 258 
scheduler 257 
to integer conversions 52 
variables and expressions are type float 9 
FXCH instruction 245 

I 

if statement 16, 33 
immediates, 8-bit sign-extended 87 
IMUL instruction 164 
inline functions 149, 170 
inline REP string with low counts 168 
instruction 
cache 252 
control unit 254 
short encodings 80 
integer 

arithmetic, 64-bit 170 
division 43 
execution unit 256 
operand, consider sign 48 
scheduler 256 

use 32-bit data types for integer code 47 

L 

L2 cache controller 259 
LEA instruction 77, 85 
LEAVE instruction 83 
load/store 22, 258 
load-execute instructions 73 
floating-point instructions 74 
integer instructions 73 
local functions 34 
local variables 41, 44 
LOOP instruction 141 
loops 

generic loop hoisting 31 
minimize pointer arithmetic 154 
partial loop unrolling 146 
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REP string with low variable counts 168 
unroll small loops 13 
unrolling loops 145 

M 

memory 

dynamic memory allocation 19 
pushing memory data 157 
MMX™ instructions 

PANDN instruction 137 
PREFETCHNTA/T0/T1/T2 instructions 105 
MOVZX and MOVSX instructions 153 
multiplication 
by constant 164 

multiplies over division, floating-point 238 
muxing constructs 136 

N 

Nonuniform Memory Access 96 

o 

operands 

largest possible operand size, repeated string 168 

P 

parallelism 35 
PF2ID instructions 52 
pointers 

dereferenced arguments 44 
use array-style code instead 10 
population-count function 179 
prefetch 

determining distance 108 
multiple 107 

PREFETCH and PREFETCHW instructions 104, 106, 108 
prototypes 29 


stack 

alignment considerations 122 
store-to-load forwarding 20, 22, 100-103 
String Instructions 167 
string instructions 167 
structure (struct) 41, 117, 119 
subexpressions, explicitly extract common 37 
superscalar processor 251 
switch statement 25, 28, 33 

u 

unit-stride access 105, 110 

w 

write combining 113, 260, 263—264, 266 

X 

XOR instruction 169 


R 

recursive functions 132 
register reads and writes, partial 81 
REP prefix 168 


s 

scalar code translated into 3DNow! code 138 
scheduling 144 
SHLD instruction 85 
SHR instruction 85 

single-byte near-return RET instruction (opcode C3h) 128 
SSE 193, 355 
SSE2 193, 355 
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